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ETAPS Foreword 


Welcome to the 23rd ETAPS! This is the first time that ETAPS took place in Ireland in 
its beautiful capital Dublin. 

ETAPS 2020 was the 23rd instance of the European Joint Conferences on Theory 
and Practice of Software. ETAPS is an annual federated conference established in 
1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from 
theoretical computer science to foundations of programming language developments, 
analysis tools, and formal approaches to software engineering. Organizing these 
conferences in a coherent, highly synchronized conference program enables researchers 
to participate in an exciting event, having the possibility to meet many colleagues 
working in different directions in the field, and to easily attend talks of different 
conferences. On the weekend before the main conference, numerous satellite 
workshops took place that attracted many researchers from all over the globe. Also, for 
the second time, an ETAPS Mentoring Workshop was organized. This workshop is 
intended to help students early in the program with advice on research, career, and life 
in the fields of computing that are covered by the ETAPS conference. 

ETAPS 2020 received 424 submissions in total, 129 of which were accepted, 
yielding an overall acceptance rate of 30.4%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their 
contributions, and in particular the PC (co-)chairs for their hard work in running this 
entire intensive process. Last but not least, my congratulations to all authors of the 
accepted papers! 

ETAPS 2020 featured the unifying invited speakers Scott Smolka (Stony Brook 
University) and Jane Hillston (University of Edinburgh) and the conference-specific 
invited speakers (ESOP) Isil Dillig (University of Texas at Austin) and (FASE) Willem 
Visser (Stellenbosch University). Invited tutorials were provided by Erika Abraham 
(RWTH Aachen University) on the analysis of hybrid systems and Madhusudan 
Parthasarathy (University of Illinois at Urbana-Champaign) on combining Machine 
Learning and Formal Methods. On behalf of the ETAPS 2020 attendants, I thank all the 
speakers for their inspiring and interesting talks! 

ETAPS 2020 took place in Dublin, Ireland, and was organized by the University of 
Limerick and Lero. ETAPS 2020 is further supported by the following associations and 
societies: ETAPS e.V., EATCS (European Association for Theoretical Computer 
Science), EAPLS (European Association for Programming Languages and Systems), 
and EASST (European Association of Software Science and Technology). The local 
organization team consisted of Tiziana Margaria (general chair, UL and Lero), 
Vasileios Koutavas (Lero@UCD), Anila Mjeda (Lero@UL), Anthony Ventresque 
(Lero@ UCD), and Petros Stratis (Easy Conferences). 
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The ETAPS Steering Committee (SC) consists of an Executive Board, and 
representatives of the individual ETAPS conferences, as well as representatives of 
EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (chair, Twente), Joost-Pieter Katoen (Aachen and 
Twente), Jan Kofron (Prague), Gerald Liittgen (Bamberg), Tarmo Uustalu (Reykjavik 
and Tallinn), Caterina Urban (Inria, Paris), and Lenore Zuck (Chicago). 

Other members of the SC are: Armin Biere (Linz), Jordi Cabot (Barcelona), Jean 
Goubault-Larrecq (Cachan), Jan-Friso Groote (Eindhoven), Esther Guerra (Madrid), 
Jurriaan Hage (Utrecht), Reiko Heckel (Leicester), Panagiotis Katsaros (Thessaloniki), 
Stefan Kiefer (Oxford), Barbara Kónig (Duisburg), Fabrice Kordon (Paris), Jan 
Kretinsky (Munich), Kim G. Larsen (Aalborg), Tiziana Margaria (Limerick), Peter 
Müller (Zurich), Catuscia Palamidessi (Palaiseau), Dave Parker (Birmingham), 
Andrew M. Pitts (Cambridge), Peter Ryan (Luxembourg), Don Sannella (Edinburgh), 
Bernhard Steffen (Dortmund), Mariélle Stoelinga (Twente), Gabriele Taentzer 
(Marburg), Christine Tasson (Paris), Peter Thiemann (Freiburg), Jan Vitek (Prague), 
Heike Wehrheim (Paderborn), Anton Wijs (Eindhoven), and Nobuko Yoshida 
(London). 

I would like to take this opportunity to thank all speakers, attendants, organizers 
of the satellite workshops, and Springer for their support. I hope you all enjoyed 
ETAPS 2020. Finally, a big thanks to Tiziana and her local organization team for all 
their enormous efforts enabling a fantastic ETAPS in Dublin! 


February 2020 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


TACAS 2020 was the 26th edition of the International Conference on Tools and 
Algorithms for the Construction and Analysis of Systems conference series. TACAS 
2020 was part of the 23rd European Joint Conferences on Theory and Practice of 
Software (ETAPS 2020). The conference was held at the Royal Marine Hotel in 
Dublin, Ireland, during April 25-30, 2020. 

TACAS is a forum for researchers, developers, and users interested in rigorously 
based tools and algorithms for the construction and analysis of systems. The conference 
aims to bridge the gaps between different communities with this common interest and 
to support them in their quest to improve the utility, reliability, flexibility, and effi- 
ciency of tools and algorithms for building systems. TACAS solicited four types of 
submissions: 


— Research papers advancing the theoretical foundations for the construction and 
analysis of systems 

— Case study papers with an emphasis on a real-world setting 

— Regular tool papers presenting a new tool, a new tool component, or novel 
extensions to an existing tool and requiring an artifact submission 

— Tool demonstration papers focusing on the usage aspects of tools, also subject to 
the artifact submission requirement 


This year 155 papers were submitted to TACAS, consisting of 111 research papers, 
8 case study papers, 19 regular tool papers, and 17 tool demo papers. Individual authors 
were limited to a maximum of three submissions. Each paper was reviewed by at least 
three Program Committee (PC) members, who also provided feedback whether certain 
papers should go through a rebuttal process. 

The chairs asked for 59 rebuttals, usually following such rebuttal recommendations 
by PC members. In parallel to PC reviewing, the Artifact Evaluation Committee 
(AEC) reviewed the artifacts. A formal summary review of this evaluation was made 
available to the PC members and taken into account in the discussion phase. The case 
study chair and the tools chair made sure that identical reviewing and selection criteria 
were applied within their respective class of papers. After this thorough reviewing, 
rebuttal and discussion phase, a total of 48 papers were accepted, including 31 research 
papers, 4 case study papers, 5 regular tool papers and 8 tool demo papers. 

As in 2019, TACAS 2020 included an artifact evaluation (AE) for all types of 
papers. There were two rounds of the AE: for regular tool papers and tool demon- 
stration papers AE was compulsory and artifacts had to be submitted to the first round. 
For research and case study papers, it was voluntary, and artifacts could be submitted to 
either the first or the second round. The results of the first round were communicated to 
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the TACAS PC before their discussion phase so that the quality of the artifact could be 
considered prior to the TACAS decision making. Each artifact was evaluated inde- 
pendently by at least three reviewers. All accepted papers with accepted artifacts 
received a badge which is added to the title page of the respective paper if desired by 
the authors. 

The AEC used a two-phase reviewing process: reviewers first performed an initial 
check to see whether the artifact was technically usable and whether the accompanying 
instructions were consistent, followed by a full evaluation of the artifact. The main 
criteria for artifact acceptance was consistency with the paper, with completeness, and 
documentation being handled in a more lenient manner as long as the artifact was 
useful overall. 

In the first round, out of 44 artifact submissions, 29 were accepted and 15 were 
rejected. This corresponds to an acceptance rate of 6696. Out of the 36 artifacts for 
regular tool papers and tool demonstration papers, 25 artifacts were accepted and 
11 artifacts were rejected resulting in an acceptance rate of 69%. In all but five cases, 
tool papers whose artifacts did not pass the evaluation were rejected. Those 5 artifacts 
were invited for submission in the second evaluation round and 3 of these artifacts were 
resubmitted and successfully evaluated. Overall, out of the 20 artifacts submitted to the 
second evaluation round, 17 were accepted and 3 were rejected resulting in an 
acceptance rate of 85%. 

TACAS 2020 also hosted the 9th International Competition on Software Verifica- 
tion (SV-COMP 2020), chaired and organized by Dirk Beyer. The competition had 
again a high participation: 28 verification systems with developers from 11 countries 
were submitted for the systematic comparative evaluation, including 3 submissions 
from industry. Six teams contributed validators for verification witnesses. The TACAS 
proceedings includes the competition report and short papers describing 11 of the 
participating verification systems. These papers were reviewed by a separate 
SV-COMP program committee; each of the papers was assessed by at least three 
reviewers. Two sessions in the TACAS program were reserved for the presentation 
of the results: the summary by the SV-COMP chair and the participating tools by the 
developer teams in the first session, and the open community meeting in the second 
session. 

We are grateful to everyone who helped to make TACAS 2020 a success. In 
particular, we would like to thank all PC members, external reviewers, and the 
members of the AEC for their detailed and informed reviews and for their discussions 
during the virtual PC and AEC meetings. The collection and selection of papers was 
organized through the EasyChair Conference System and the proceedings volumes 
were published with the help of Springer; we thank them all for their assistance. We 
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also thank the SC for their advice, the Organizing Committee of ETAPS 2020 and its 
general chair (Tiziana Margaria) and the chair of the ETAPS Executive Board (Marieke 


Huisman). 
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Abstract. Branching bisimilarity is a behavioural equivalence relation 
on labelled transition systems (LTSs) that takes internal actions into 
account. It has the traditional advantage that algorithms for branch- 
ing bisimilarity are more efficient than ones for other weak behavioural 
equivalences, especially weak bisimilarity. With m the number of tran- 
sitions and n the number of states, the classic O(mn) algorithm was 
recently replaced by an O(m(log|Act| + logn)) algorithm [9], which is 
unfortunately rather complex. This paper combines its ideas with the 
ideas from Valmari [20], resulting in a simpler O(m log n) algorithm. 
Benchmarks show that in practice this algorithm is also faster and of- 
ten far more memory efficient than its predecessors, making it the best 
option for branching bisimulation minimisation and preprocessing for 
calculating other weak equivalences on LTSs. 


Keywords: Branching bisimilarity - Algorithm - Labelled transition 
systems 


1 Introduction 


Branching bisimilarity [8] is an alternative to weak bisimilarity [17]. Both equiva- 
lences allow the reduction of labelled transition systems (LTSs) containing tran- 
sitions labelled with internal actions, also known as silent, hidden or 7-actions. 

One of the distinct advantages of branching bisimilarity is that, from the 
outset, an efficient algorithm has been available [10], which can be used to cal- 
culate whether two states are equivalent and to calculate a quotient LTS. It has 
complexity O(mn) with m the number of transitions and n the number of states. 
It is more efficient than classic algorithms for weak bisimilarity, which use tran- 
sitive closure (for instance, [16] runs in O(n?m log n + mn??"6), where n?376 is 
the time for computing the transitive closure), and algorithms for weak simula- 
tion equivalence (strong simulation equivalence can be computed in O(mn) [12], 
and for weak simulation equivalence first the transitive closure needs to be com- 
puted). The algorithm is also far more efficient than algorithms for trace-based 
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equivalence notions, such as (weak) trace equivalence or weak failure equiva- 
lence [16]. 

Branching bisimilarity also enjoys the nice mathematical property that there 
exists a canonical quotient with a minimal number of states and transitions 
(contrary to, for instance, trace-based equivalences). Additionally, as branching 
bisimilarity is coarser than virtually any other behavioural equivalence taking 
internal actions into account [7], it is ideal for preprocessing. In order to calcu- 
late a desired equivalence, one can first reduce the behaviour modulo branching 
bisimilarity, before applying a dedicated algorithm on the often substantially 
reduced transition system. In the mCRL2 toolset [5] this is common practice. 

In [9,11] an algorithm to calculate stuttering equivalence on Kripke struc- 
tures with complexity O(mlogn) was proposed. Stuttering equivalence essen- 
tially differs from branching bisimilarity in the fact that transitions do not have 
labels and as such all transitions can be viewed as internal. In these papers it 
was shown that branching bisimilarity can be calculated by translating LTSs to 
Kripke structures, encoding the labels of transitions into labelled states follow- 
ing [6,19]. This led to an O(m(log |Act| + logn)) or O(mlog m) algorithm for 
branching bisimilarity. 

Besides the time complexity, the algorithm in [9,11] has two disadvantages. 
First, the translation to Kripke structures introduces a new state and a new 
transition per action label and target state of a transition, which increases the 
memory required to calculate branching bisimilarity. This made it far less mem- 
ory efficient than the classical algorithm of [10], and this was perceived as a 
substantial practical hindrance. For instance, when reducing systems consisting 
of tens of millions of states, such as [2], memory consumption is the bottleneck. 
Second, the algorithm in [9,11] is very complex. To illustrate the complexity, 
implementing it took approximately half a person-year. 


Contributions. We present an algorithm for branching bisimilarity that runs 
directly on LTSs in O(mlogn) time and that is simpler than the algorithm 
of [9,11]. To achieve this we use an idea from Valmari and Lehtinen [20,21] 
for strong bisimilarity. The standard Paige-Tarjan algorithm [18], which has 
O(m log n) time complexity for strong bisimilarity on Kripke structures, registers 
work done in a separate partition of states. Valmari [20] observed that this leads 
to complexity O(m log m) on LTSs and proposed to use a partition of transitions, 
whose elements he (and we) calls bunches, to register work done. This reduces 
the time complexity on LTSs to O(mlog n). 

Using this idea we design our more straightforward algorithm for branching 
bisimilarity on LTSs. Essentially, this makes the maintenance of action labels 
particularly straightforward and allows to simplify the handling of new, so-called, 
bottom states [10]. It also leads to a novel main invariant, which we formulate as 
Invariant 1. It allows us to prove the correctness of the algorithm in a far more 
straightforward way than before. 

We have proven the correctness and complexity of the algorithm in detail [14] 
and demonstrate that it outperforms all preceding algorithms both in time and 
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space when the LTSs are sizeable. This is illustrated with more than 30 example 
LTSs. This shows that the new algorithm pushes the state-of-the-art in com- 
paring and minimising the behaviour of LTSs w.r.t. weak equivalences, either 
directly (branching bisimilarity) or using the form of a preprocessing step (for 
other weak equivalences). 

Despite the fact that this new algorithm is more straightforward than the 
previous O(m(log | Act| + log n)) algorithm [9], the implementation of the algo- 
rithm is still not easy. To guard against implementation errors, we extensively 
applied random testing, comparing the output with that of other algorithms. The 
algorithms and their source code are freely available in the mCRL2 toolset [5]. 


Overview of the article. In Section 2 we provide the definition of LTSs and 
branching bisimilarity. In Section 3 we provide the core algorithm with high-level 
data structures, correctness and complexity. The subsequent section presents the 
procedure for splitting blocks, which can be presented as an independent pair 
of coroutines. Section 5 presents some benchmarks. Proofs and implementation 
details are omitted in this paper, and can be found in [14]. 


2 Branching bisimilarity 


In this section we define labelled transition systems and branching bisimilarity. 


Definition 1 (Labelled transition system). A labelled transition system 

(LTS) is a triple A = (S, Act, —) where 

1. S is a finite set of states. The number of states is denoted by n. 

2. Act is a finite set of actions including the internal action T. 

3. — C S x Act x S is a transition relation. The number of transitions is 
necessarily finite and denoted by m. 


It is common to write t 5» t' for (t,a,t’) € —. With slight abuse of notation we 
write t 5» t' € T instead of (t,a,t') € T for T C —. We also write t + Z for the 
set of transitions (t ^ t' | t € Z}, and Z 5 Z' for the set (t 5 t'|te€ Z,t' € 
Z'\. We call all actions except T the visible actions. If t 5 t', we say that from 
t, the state t’, the action a, and the transition t 2» t' are reachable. 


Definition 2 (Branching bisimilarity). Let A = (S, Act, —) be an LTS. We 
call a relation R C S x S a branching bisimulation relation iff it is symmetric 
and for all s,t € S such that s Rt and all transitions s & s' we have: 

1.a=T and s' Rt, or 

2. there is a sequence t > --- 5 t 5 t" such that s Rt! and s' Rt". 

Two states s and t are branching bisimilar, denoted by s ©, t, iff there is a 
branching bisimulation relation R such that s Rt. 


Note that branching bisimilarity is an equivalence relation. Given an equivalence 
relation R, a transition s 5 t is called inert iff a = r and s Rt. Ift 5 t > 
e D ta-1 > th S t such that t Rt; for 1 < i € n, we say that the state tn, 
the action a, and the transition tn “yt are inertly reachable from t. 
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The equivalence classes of branching bisimilarity partition the set of states. 


Definition 3 (Partition). For a set X a partition IJ of X is a disjoint cover 
of X, ie., H = {Bi C X | Bi # 0,1 <i € k} such that Bj à Bj = 0 for all 
1<i<j<k and X —U, 2-4 Bi- 

A partition II’ is a refinement of II iff for every B’ € II’ there is some 
B € II such that B' C B. 


We will often use that a partition JI induces an equivalence relation in the 
following way: s =z t iff there is some B € I containing both s and t. 


3 The algorithm 


In this section we present the core algorithm. In the next section we deal with 
the actual splitting of blocks in the partition. We start off with an abstract 
description of this core part. 


3.1 High-level description of the algorithm 


The algorithm is a partition refinement algorithm. It iteratively refines two par- 
titions H, and HM. Partition HM, is a partition of states in S that is coarser 
than branching bisimilarity. We refer to the elements of Is as blocks, typically 
denoted using B. Partition I, partitions the non-inert transitions of =>, where 
inertness is interpreted with respect to =7,. We refer to the elements of IM, as 
bunches, typically denoted using T. 

The partition of transitions J, records the current knowledge about transi- 
tions. Transitions are in different bunches iff the algorithm has established that 
they cannot simulate each other (i.e., they cannot serve as s 5 s' and t 5 t" 
in Definition 2). 

The partition of states IZ, records the current knowledge about branching 
bisimilarity. Two states are in different blocks iff the algorithm has found a proof 
that they are not branching bisimilar (this is formalised in Invariant 3). This 
implies that Js must be such that states with outgoing transitions in different 
combinations of bunches are in different blocks (Invariant 1). 

Before performing partition refinement, the LTS is preprocessed to contract 
T-strongly connected components (SCCs) into a single state without a T-loop. 
This step is valid as all states in a 7-SCC are branching bisimilar. Conse- 
quently, every block has bottom states, i.e., states without outgoing inert T- 
transitions [10]. 

The core invariant of the algorithm says that if one state in a block can 
inertly reach a transition in a bunch, all states in that block can inertly reach a 
transition in this bunch. This can be formulated in terms of bottom states: 


Invariant 1 (Bunches). I, is stable under Iz, i.e., if a bunch T € II; contains 
a transition with its source state in a block B € H, then every bottom state in 
block B has a transition in bunch T. 
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The initial partitions H, and J, are the coarsest partitions that satisfy Invari- 
ant 1. I; starts with a single bunch consisting of all non-inert transitions. Then, 
in IT, we need to separate states with some transition in this bunch from those 
without. We define Byis to be the set of states from which a visible transition is 
inertly reachable, and Binyis to be the other states. Then Hs = { Byis, Binvis}\{O}. 

Transitions in a bunch may have different labels or go to different blocks. In 
that case, the bunch can be split as these transitions cannot simulate each other. 
If we manage to achieve the situation where all transitions in a bunch have the 
same label and go to the same target block, the obtained partition turns out 
to be a branching bisimulation. Therefore, we want to split each bunch into so- 
called action-block-slices defined below. We also immediately define some other 
sets derived from I, and H, as we require them in our further exposition. So, 
we have: 

— The action-block-slices, i.e., the transitions in T with label a ending in B’: 
Tap ={s >s €T|s' eB'). 
The block-bunch-slices, i.e., the transitions in T starting in B: 
To, = {s & s' €T |s € B}. 
A block-bunch-slice intersected with an action-block-slice: 
Tgp = Tp N Tp = {s E SET | sEBAs' B'. 
— 'The bottom states of D, i.e., the states without outgoing inert transitions: 
Bottom(B) = (s € B | «ds! € B.s 4 s’}. 
The states in B with a transition in bunch T: B®, = {s | s ^ s' € Tg). 
The outgoing transitions of block B: B, = (s 5 s'|s€ B,a € Act,s' € S]. 
— The incoming transitions of block B: BL = (s ^ s' | s € S,a € Act,s' € B]. 
The block-bunch-slices and action-block-slices are explicitly maintained as aux- 
iliary data structures in the algorithm in order to meet the required performance 
bounds. If the partitions M, or H, are adapted, all the derived sets above also 
change accordingly. 

A bunch can be trivial, which means that it only contains one action-block- 
slice, or it can contain multiple action-block-slices. In the latter case one action- 
block-slice is split off to become a bunch by itself. However, this may invalidate 
Invariant 1. Some states in a block may only have transitions in the new bunch 
while other states have only transitions in the old bunch. Therefore, blocks have 
to be split to satisfy Invariant 1. Splitting blocks can cause bunches to become 
non-trivial because action-block-slices fall apart. 

'This splitting is repeated until all bunches are trivial, and as already stated 
above, the obtained partition Js, is the required branching bisimulation. As the 
transition system is finite this process of repeated splitting terminates. 


3.2 Abstract algorithm 


We first present an abstract version of the algorithm in Algorithm 1. Its be- 
haviour is as follows. As long as there are non-trivial bunches—i.e, bunches 
containing multiple action-block-slices—, these bunches need to be split such 
that they ultimately become trivial. The outer loop (Lines 1.2-1.19) takes a 
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Algorithm 1 Abstract algorithm for branching bisimulation partitioning 


1.1: Contract T-SCCs; initialize MHs and II, 
1.2: for all non-trivial bunches T € II; do 


1.3: Select an action-block-slice Ta, 5; C T 

1.4: Split T into T ?, 5; and T \ T 2, p; 

1.5: for all unstable blocks B € II, (i.e., 0 Z Tg ^, 5; # Tp.,) do 

1.6: First make Tg 2, 5; a primary splitter; then make Tg VT'5 2, 5; a secondary splitter 
Y. end for 

1.8: for all splitters Th, (in order) do 

1.9: Split B into the subblock R that can inertly reach Tp, and the rest U 

1.10: if T5 , was a primary splitter (note: T5 , = T5 2, p’) then 

1.11: Make Tp. , V Ty 2, g; a non-splitter 

1.12: end if 

1.13: if there are new non-inert transitions R => U then 

1.14: Split R into the subblock N that can inertly reach R +; U and the rest R’ 
1.15: Make all block-bunch-slices Ty_, of N secondary splitters 

1.16: Create a bunch for the new non-inert transitions (N = U) U (N — R’) 
1.17: end if 

1.18: end for 


1.19: end for 
1.20: return Hs 


non-trivial bunch T from I, and from this it moves an action-block-slice T's, p; 
into its own bunch in IJ; (Line 1.4). Hence, bunch T is reduced to T V T's, p. 

The two new bunches T's, p; and T V T$% p can cause instability, violating 
Invariant 1. This means there can be blocks with transitions in one new bunch, 
but some bottom states only have transitions in the other new bunch. For such 
blocks, stability needs to be restored by splitting them. 

To restore this stability we investigate all block-bunch-slices in one of the 
new bunches, namely T's, p;. Blocks that do not have transitions in these block- 
bunch-slices are stable with respect to both bunches. To keep track of the blocks 
that still need to be split, we partition the block-bunch-slices Tg— into stable 
and unstable block-bunch-slices. A block-bunch-slice is stable if we have ensured 
that it is not a splitter for any block. Otherwise it is deemed unstable, and it 
needs to be checked whether it is stable, or whether the block B must be split. 
The first inner loop (Lines 1.5-1.7) inserts all unstable block-bunch-slices into 
the splitter list. Block-bunch-slices of the shape Ts, p: in the splitter list are 
labelled primary, and other list entries are labelled secondary. 

In the second loop (Lines 1.8-1.18), one splitter T5 , from the splitter list is 
taken at a time and its source block is split into R (the part that can inertly reach 
Th) and U (the part that cannot inertly reach T5 ,) to re-establish stability. 

If T5 , was a primary splitter of the form Tp.*,p;, then we know that U 
must be stable under Ty_, V Ty 2,8’, as every bottom state in B has a transition 
in the former block-bunch-slice T5.,, and as the states in U have no transition 
in Tp2, p, every bottom state in U must have a transition in Tp., V Tp *, g. 
Therefore, at Line 1.11, block-bunch-slice Ty_, \ Tup can be removed from 
the splitter list. This is the three-way split from [18]. 

Some inert transitions may have become non-inert, namely the r-transitions 
that go from R to U. There cannot be 7-transitions from U to R. The new non- 
inert transitions were not yet part of a bunch in IM. So, a new bunch R 5 U is 
formed for them. All transitions in this new bunch leave R and thus R is the only 
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block that may not be stable under this new bunch. To avoid superfluous work, 
we split off the unstable part N, i.e. the part that can inertly reach a transition 
in R 5 U and contains all new bottom states, at Line 1.14. The original bottom 
states of R become the bottom states of R’. There can be transitions N > R' 
that also become non-inert, and we add these to the new bunch R 5 U. As 
observed in [10], blocks containing new bottom states can become unstable under 
any bunch. So, stability of N (but not of R’) must be re-established, and all 
block-bunch-slices leaving N are put on the splitter list at Line 1.15. 


3.3  Correctness 


The validity of the algorithm follows from a number of major invariants. The 
main invariant, Invariant 1, is valid at Line 1.2. Additionally, the algorithm 
satisfies the following three invariants. 


Invariant 2 (Bunches are not unnecessarily split). For any pair of non- 
inert transitions s 5» s' and t 5 t', if s,t € B and s',t' € B' then s s €T 
and t 5 t' € T for some bunch T € Ik. 


Invariant 3 (Preservation of branching bisimilarity). For all states s,t € 
S, if s &yt, then there is some block B € II, such that s,t € B. 


Invariant 4 (No inert loops). There is no inert loop in a block, i.e., for every 
sequence sı 359 > --- D sn with s; € B € II, n » 1 it holds that sı F Sa. 


Invariant 2 indicates that two non-inert transitions that (1) start in the same 
block, (2) have the same label, and (3) end in the same block, always reside in 
the same bunch. Invariant 3 says that branching bisimilar states never end up in 
separate blocks. Invariant 4 ensures that all 7-paths in each block are finite. As 
a consequence every block has at least one bottom state, and from every state a 
bottom state can be inertly reached. 

'The invariants given above allow us to prove that the algorithm works cor- 
rectly. When the algorithm terminates (and this always happens, see Section 3.5), 
branching bisimilar states are perfectly grouped in blocks. 


Theorem 1. From the Invariants 1, 3 and 4, it follows that after the algorithm 
terminates, =r, = ©». 


Because of the space restrictions here, the proofs are omitted. The interested 
reader is referred to [14] for the details. 


3.4 In-depth description of the algorithm 


To show that the algorithm has the desired O(m log n) time complexity, we now 
give a more detailed description of the algorithm. The pseudocode of the detailed 
algorithm is given in Algorithm 2. This algorithm serves two purposes. First of 
all, it clarifies how the data structures are used, and refines many of the steps in 
the high-level algorithm. Additionally, time budgets for parts of the algorithm 
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Algorithm 2 Detailed algorithm for branching bisimulation partitioning 


2.1: Find 7-SCCs and contract each of them to a single state 
2.2: Byis :— (s € S | s can inertly reach some s’ 4 s"; Binvis = S \ Bys Om) 
2.3: Ms :— {Byis, Binvis} \ (0) kg 
2.4: I; := ((s “> s' | a € Act \ (7), s,s" € S} U Byis > Binvis} 
2.5: for all non-trivial bunches T € H, do y m iterations 
2.6: Select a € Act and B’ € II, with |T ^, g| € 4 |T| ] 
2.7: M, := (Mi \ {T} U{T 2, p, TNT, pi) 
2.8: for all unstable blocks B € II; with Ø C T5 2,5; C Tp_, do 
2.9: Append Tg 2,5; as primary to the splitter list 
2.10: Append T'g., V Tp 2, g/ as secondary to the splitter list OT 
2.11: Mark all transitions in Tg 2,5; ! 
2.12: For every state € B with both marked outgoing transitions 
and outgoing transitions in T'5., V Tg 2, g;, mark one such 
transition 

2.13: end for 
2.14: for all splitters Th, in the splitter list (in order) do ) € |[T.2, p| iterations 
2.15: (R,U) :— split(B, T5 ,) 
2.16: Remove IDE = TE. from the splitter list 
2.17: II; := (Is N (BJ) U({R, U} N {0} 
2.18: if T5 , was a primary splitter (note: T5 , = Tp 4, 5/) then| O(|Marked(1 
2.19: Remove Ty V Ty 2,5; from the splitter list l l 
2.20: end if Be n(N 
2.21: if RU #0 then or O R| 
2.22: Create a new bunch containing exactly R — U, add 

RGU = (R 4 U)n., to the splitter list, and mark 

all its transitions 
2.23: (N, R’) := split(R, R + U) O(|R =t 
2.24: Remove R — U = (R 4, U)n- from the splitter list R',| 4- |R 
2.25: II, :— (IIGN {R} U ({N, R’} N (0)) Be m(N 
2.26: Add N — R’ to the bunch containing R > U J or O(|N_,| N 
2.27: Insert all Ty— as secondary into the splitter list ) O(|Bottom* (N 
2.28: For each bottom state € N, mark one of its REUS, o ; N) 

transitions in every T'y., where it has one : ‘ 
2.29: end if 
2.30: end for 


2.31: end for 
2.32: return l1, 


are printed in grey at the right-hand side of the pseudocode. We use these time 
budgets in Section 3.5 to analyse the overall complexity of the algorithm. We 
focus on the most important details in the algorithm. 

At Lines 2.6-2.7, a small action-block-slice T2, p; is moved into its own bunch, 
and T is reduced to T \ T2, p. All blocks that have transitions in the two new 
bunches are added to the splitter list in Lines 2.8-2.13. This loop also marks 
some transitions (in the time complexity annotations we write Marked(Tg— ) for 
the marked transitions of block-bunch-slice T'5.,). The function of this marking 
is similar to that of the counters in [18]: it serves to determine quickly whether a 
bottom state has a transition in a secondary splitter T'p., VT p.23, p; (or slices that 
are the result of splitting this slice). In general, a bottom state has transitions 
in some splitter block-bunch-slice if and only if it has marked transitions in 
this slice. There is one exception: After splitting under a primary splitter Tg, 
bottom states in U are not marked. But as they always have a transition in 
Tu, NTu 2, pg, U is already stable in this case (see Line 2.19). 


'The second loop is refined to Lines 2.14—2.30. In every iteration one splitter 
5, from the splitter list is considered, and its source block is first split into R 
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and U. Formally, the routine split(B, T) delivers the pair (R,U) defined by: 


T 


R={s€B|s 5 sı >--- 4 sn 5 s! where s1,...,Sn € B, Sn > s' € T), 
TH BYR. (1) 


We detail its algorithm and discuss its correctness in Section 4. 

In Lines 2.21-2.28, the situation is handled when some inert transitions have 
become non-inert. We mark one of the outgoing transitions of every new bottom 
state such that we can find the bottom states with a transition in Ty, in time 
proportional to the number of such new bottom states. 

We illustrate the algorithm in the following example. Note this also illustrates 
some of the details of the split subroutine, which is discussed in detail in Section 4. 


Example 1. Consider the situation in Figure 1a. Observe that block B is stable 
w.r.t. the bunches T and T". We have split off a small bunch T's,g; from T, and 
as a consequence, B needs to be restabilised. The bunches put on the splitter list 
initially are T's, p; and T V T'*,g.. When putting these bunches on the splitter 
list, all transitions in Tgp are marked, see the m’s in Figure 1b. Also, for 
states that have transitions both in T'?, p; and in T\ T=, p, one transition in the 
latter bunch is marked, see the ms in Figure 1b. 

We now first split B w.r.t. the primary splitter T's, p; into R, the states that 
can inertly reach T's, p;, and U, the states that cannot. In Figure 1b, the states 
known to be destined for R are indicated by ©, the states known to be destined 
for U are indicated by ©. Initially, all states with a marked outgoing transition 
are destined for R, the remaining bottom state of B is destined for U. The split 
subroutine proceeds to extend sets R and U in a backwards fashion using two 
coroutines, marking a state destined for R if one of its successors is already in R, 
and marking a state destined for U if all its successors are in U. Here, the state 
in U does not have any incoming inert transitions, so its coroutine immediately 
terminates and all other states belong to R. Block B is split into subblocks R 
and U, as shown in Figure Ic. Block U is stable w.r.t. both T's, pg; and TV T's, p. 

We still need to split R w.r.t. T V Tsp, into Ry and Ui, say. For this, we 
use the marked transitions in T V T's, p; as a starting point to compute all bot- 
tom states that can reach a transition in T \ Tsp. This guarantees that the 
time we use is proportional to the size of T's, p. Initially, there is one state des- 
tined for R4, marked © in Figure 1c, and one state destined for U;, marked © 
in the same figure. We now perform the two coroutines in split simultaneously. 
Figure 1d shows the situation after both coroutines have considered one tran- 
sition: The U;-coroutine (which calculates the states that cannot inertly reach 
TN Tp) has initialised the counter untested of one state to 2 on Line 3.94 
of Algorithm 3 because two of its outgoing inert transitions have not yet been 
considered. The F'-coroutine (which calculates the states that can inertly reach 
TNTs,g) has checked the unmarked transition in the splitter Tg. \ Tg, p. 
As the latter coroutine has finished visiting unmarked transitions in the splitter, 
the Uj-coroutine no longer needs to run the slow test loop at Lines 3.13/—3.174 
of the left column of Algorithm 3. In Figure 1e the situation is shown after two 
more steps in the coroutines. Each has visited two extra transitions. There two 
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Fig. 1: Illustration of splitting of a small block from T and stabilising block B 

with respect to the new bunches T's, p and T\ T$ p, as explained in Example 1. 
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extra are states destined for R,, marked ©, and one state is destined for U; with 
0 remaining inert transitions, for which we know immediately that it has no 
transition in T V T's, p/, this is marked ©. Now, the R;-coroutine is terminated, 
since it contains more that E | R| states, and the remaining incoming transitions 
of states in U; are visited. This will not further extend U,. The result of splitting 
is shown in Figure 1f. Some inert transitions become non-inert, so a new bunch 
with transitions Rı > U4 is created, and all these transitions are marked m. 

We next have to split Rı with respect to this new bunch into the set of 
states N; that can inertly reach a transition in the new bunch, and the set 
R} that cannot inertly reach this bunch. In this case, all states in Rı have a 
marked outgoing transition, hence Ny = R4, and Ri = Ø. The coroutine that 
calculates the set of states that cannot inertly reach a transition in the bunch 
will immediately terminate because there are no transitions to be considered. 

Observe that Ry (= Ni) has a new bottom state, marked ‘nb’. This means 
that stability of R4 with respect to any bunch is not guaranteed any more and 
needs to be re-established. We therefore consider all bunches in which Rı has an 
outgoing transition. We add T, *, p/, Tr, + V Tg, 2, p: and Tp _, to the splitter 
list as secondary splitters, and mark one outgoing transition from each bottom 
state in each of these bunches using m. This situation is shown in Figure 1g. 

In this case, R; is stable w.r.t. Tr, *, p; and Tg, \ TR $p, ie., all states in 
R4 can inertly reach a transition in both bunches. In both cases this is observed 
immediately after initialisation in split, since the set of states that cannot inertly 
reach a transition in these bunches is initially empty, and the corresponding 
coroutine terminates immediately. 

Therefore, consider splitting Rı with respect to Ts. .,. This leads to R2, the 
set of states that can inertly reach a transition in T’, and Us, the set of states 
that cannot inertly reach a transition in T’. Note there are no marked transitions 
in Tr, _,, 80 initially all bottom states of Rı are destined for U5 (marked © in 
Figure 1h), and there are no states destined for Ra. Then we start splitting R4. In 
the Ra-coroutine, we first add the states with an unmarked transition in Tr, _, to 
Fi» at Line 3.4r (i.e., in the right column of Algorithm 3) and then all predecessors 
of the new bottom state need to be considered. When split terminates, there will 
be no additional states in U5, and the remaining states end up in Ro. 

The situation after splitting Rı into Rə and U2 is shown in Figure 1i. One of 
the inert transitions (marked m) becomes non-inert. Furthermore, Rə contains a 
new bottom state. This is the state with a transition in T". As each block must 
have a bottom state, a non-bottom state had to become a bottom state. 

We need to continue stabilising Rə w.r.t. bunch Rə + U2, which does not 
lead to a new split, and we need to restabilise Rə w.r.t. all bunches in which it 
has an outgoing transition. This also does not lead to new splits, so the situation 
in Figure li after removing the markings is the final result of splitting. 


3.5 Time complexity 


Throughout this section, let n be the number of states and m the number of 
transitions in the LTS. To simplify the complexity notations we assume that 
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n X m + 1. This is not a significant restriction, since it is satisfied by any LTS 
in which every non-initial state has an incoming transition. We also write in(s) 
and out(s) for the sets of incoming and outgoing transitions of state s. 

We use the principle “Process the smaller half” [13]: when a set is split into 
two parts, we spend time proportional to the size of the smaller subset. This leads 
to a logarithmic number of operations assigned to each element. We apply this 
principle twice, once to new bunches and once to new subblocks. Additionally, 
we spend some time on new bottom states. This is formulated in the following 
theorem. 


Theorem 2. For the main loop of Algorithm 2 we have: 


1. A transition is moved to a new small bunch at most |log,n?| + 1 times. 
Whenever this happens, constant time is spent on this transition. 

2. A state s is moved to a new small subblock at most |log,n| times. Whenever 
this happens, O(|in(s)| + |out(s)| +1) time is spent on state s. 

3. A state s becomes a new bottom state at most once. When this happens, 
O(|out(s)| + 1) time is spent on state s. 


Summing up these time budgets leads to an overall time complexity of O(m]log n). 


These runtimes are annotated as time budgets in the main loop of Al- 
gorithm 2. Line 2.7 moves the transitions of T's,5; to their new bunch, and 
Lines 2.62.14 take time proportional to the size of this new bunch. 

A new subblock is formed at Line 2.17 (and at the same time, some states 
in subblock R may become new bottom states). Lines 2.15-2.22 take time pro- 
portional to its incoming and outgoing transitions. Similarly, a new subblock 
is formed in Line 2.23, and Lines 2.23-2.26 take time proportional to this sub- 
block's transitions. 

Finally, new bottom states found in R (and separated into N) allow to spend 
time proportional to Bottom(N)_, at Lines 2.15-2.28. At Line 2.27 we need to 
include not only the current new bottom states but also the future ones because 
there may be block-bunch-slices that only have transitions from non-bottom 
states. When JN is split under such a block-bunch-slice, at least one of these 
states will become a bottom state. 

Time spent per marked transition fits the time bound because only a small 
number of transitions is marked: In Lines 2.11 and 2.12, at most two transitions 
are marked per transition in the small splitter T's, p. Line 2.22 marks R 5 U C 
out(R) N in(U), which is always within the transitions of the smaller subblock. 
Line 2.28 marks no more transitions than the new bottom states have. 

The initialisation in Lines 2.1-2.5 can be performed in O(m) time, where 
the assumption n < m + 1 is used. Furthermore, we assume that we can access 
action labels fast enough to bucket sort the transitions in time O(m), which is 
for instance the case if the action labels are consecutively numbered. 

To meet the indicated time budgets, our implementation uses a number of 
data structures. States are stored in a refinable partition [21], grouped per block, 
in such a way that we can visit bottom states without spending time on non- 
bottom states. Transitions are stored in four linked refinable partitions, grouped 
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per source state, per target state, per bunch, and per block-bunch-slice, in such 
a way that we can visit marked transitions without spending time on unmarked 
transitions of the block. How these data structures are instrumental for the 
complexity can be found in [14]. 


4 Splitting blocks 


The function split( B, T), presented in Algorithm 3, refines block B into subblocks 
Rand U, where R contains those states in B that can inertly reach a transition in 
T, and U contains the states that cannot, as formally specified in Equation (1). 

These two sets are computed by two coroutines executing in lockstep: the two 
coroutines start the same number of loop iterations, so that the overhead is at 
most proportional to the faster of the two and all work done in both coroutines 
can be attributed to the smaller of the two subblocks R and U. 

As a precondition, split requires that bottom states of B with an outgo- 
ing transition in Tp_, have a marked outgoing transition in Tg. Formally, 
Bottom(B) Mvp), = Bottom(B) *e5,. This allows to compute the initial 
sets: All states in B «07, , i.e., sources of marked transitions in T, are put in 
R. All bottom states that are not initially in R are put in U. 

The sets are extended as follows in the coroutines. For R, first the states 
in B T Were), are added that were not yet in R. These are all the sources of 
unmarked transitions in T. Using backward reachability along inert transitions, 
R is extended until no more states can be added. 


Algorithm 3 Refinement of a block under a splitter 


3.1: function split(block B, block-bunch-slice T) 


3.2; R:= B Marked(T) ; U:= Bottom(B)\ R }o I 
3.3: begin coroutines 

3.4: Set untested[t] to undefined for all t € B|| R:— RU B TMMarked(T) yo or O(|R 
3.5: for all s € U while |U| < 4 |B| do for all s € R while ) 

3.6: for all inert t — s do |R| < 1|B| do 

3.7: if t € R then Skip t, i.e. goto 3.62 for all inert t > s do 

3.8: if untested[t] is undefined then ) 
3.9: untested[t] := |{tu | u € B}| O(|R 
3.10: end if 

3.11: untested|t] :— untested[t] — 1 

3.12: if untested[t] > 0 then Skip t 

3.13: if BT, Z R then 

3.14: for all non-inert t “+ u do O 
3.15: if t ^5 u € T then Skip t 

3.16: end for 

3.17: end if 

3.18: Add t to U Add t to R ; 
3.19: end for end for p 
3.20: end for end for = 
3.21: if |U| > $ |B| then if |R| > 4 |B| then 

3.22: Abort this coroutine Abort this coroutine 

3.23: end if end if ) 
3.24: Abort the other coroutine Abort the other coroutine 

3.25: return (B \ U,U) return (R, B \ R) 


3.26: 


end coroutines 
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To identify the states in U, observe that a state is in U if all its inert successors 
are in U and it does not have a transition in Tp_,. To compute U, we let a 
counter untested|t| for every non-bottom state t record the number of outgoing 
inert transitions to states that are not yet known to be in U. If untested[t| = 0, 
this means all inert successors of t are guaranteed to be in U, so, provided t 
does not have a transition in Tg, one can also add t to U. To take care of the 
possibility that all inert transitions of t have been visited before all sources of 
unmarked transitions in T'5., are added to R, we check all non-inert transitions 
of t to determine whether they are not in Tg, at Lines 3.13¢-3.17¢. 

The coroutine that finishes first, provided that its number of states does not 
exceed i |B|, has completely computed the smaller subblock resulting from the 
refinement, and the other coroutine can be aborted. As soon as the number 
of states of a coroutine is known to exceed i |B|, it is aborted, and the other 
coroutine can continue to identify the smaller subblock. In detail, the runtime 
complexity of (R,U) := split(B, T) is: 

m O(|R_,| ar |R_|), if |R] < IU], and 

— O(|Marked(T)| + |U_,| + |U_| + |(Bottom(R) V Bottom(B))_,|), if |U| < |R]. 
This complexity is inferred as follows. As we execute the coroutines in lockstep, 
it suffices to show that the runtime bound for the smaller subblock is satisfied. 

In case |R| € |U|, observe |Marked(T)| < |R|, so we get O(|R.,| +|R_|) 
directly from the R-coroutine. When |U| < |R|, we use time in O(| Marked (T)|) 
for Line 3.2, and we use time in O(|U..|) for everything else except Lines 3.134- 
3.174. For these latter lines, we distinguish two cases. If it turns out that t has 
no transition t 2» u € T, it is a U-state, so we attribute the time to O(|U |). 
Otherwise, it is an R-state that had some inert transitions in B, but they all are 
now in R — U. So t is a new bottom state, and we attribute the time to the 
outgoing transitions of new bottom states: O(|(Bottom(R) \ Bottom(B))., |). 


5 Experimental evaluation 


The new algorithm (JGKW20) has been implemented in the mCRL2 toolset [5] 
and is available in its 201908.0 release. This toolset also contains implementations 
of various other algorithms, such as the O(mn) algorithm by Groote and Vaan- 
drager (GV) [10] and the O(m(log | Act| + log n)) algorithm of [9] (GJKW17). 
In addition, it offers a sequential implementation of the partition-refinement al- 
gorithm using state signatures by Blom and Orzan (BO) [3], which has time 
complexity O(n?m). For each state, BO maintains a signature describing which 
blocks the state can reach directly via its outgoing transitions. 

In this section, we report on the experiments we have conducted to compare 
GV, BO, GJKW17 and JGKW20 when applied to practical examples. In the 
experiments the given LTSs are minimised w.r.t. branching bisimilarity. The set 
of benchmarks consists of all LTSs offered by the VLTS benchmark set? with 
at least 60,000 transitions. Their name ends in * |n/1000]| |m/1000]" and thus 


3 http:/ /cadp.inria.fr/resources/ vlts. 
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describes their size. Additionally, we consider three cases that have been derived 
from models distributed with the mCRL2 toolset: 
1. lift6-final: this model is based on an elevator model, extended to six eleva- 

tors (n = 6,047,527, m = 26,539,368); 

2. dining_14: this is the dining philosophers model with 14 philosophers (n = 

18,378,370, m = 164,329,284); 

3. 1394-fin3: this is an altered version of the 1394-fin model, extended to three 

processes and two data elements (n = 126,713,623, m = 276,426,688). 

The software and benchmarks used for the experiments are available online [15]. 
All experiments have been conducted on individual nodes of the DAS-5 clus- 
ter [1]. Each of these nodes was running CENTOS LiNUX 7.4, had an INTEL 
XEON E5-2698-v3 2.3GHz CPU, and was equipped with 256 GB RAM. Devel- 
opment version 201808.0.c59cfd413f of mCRL2 was used for the experiments.* 

Table 1 presents the obtained results. Benchmarks are ordered by their num- 
ber of transitions. On each benchmark, we have applied each algorithm ten times, 
and report the mean runtime and memory use of these ten runs, rounded to sig- 
nificant digits (estimated using [4] for the standard deviation). A trailing decimal 
dot indicates that the unit digit is significant. If this dot is missing, there is one 
insignificant zero. For all presented data the estimated standard deviation is less 
than 2096 of the mean. Otherwise we print *- in Table 1. 

The V-symbol after a table entry indicates that the measurement is sig- 
nificantly better than the corresponding measurements for the other three algo- 
rithms, and the A-symbol indicates that it is significantly worse. Here, the results 
are considered significant if, given a hundred tables such as Table 1, one table of 
running time (resp. memory) is expected to contain spuriously significant results. 

Concerning the runtimes, clearly, GV and BO perform significantly worse 
than the other two algorithms, and JGKW20 in many cases performs signifi- 
cantly better than the others. In particular, JGKW?20 is about 40% faster than 
GJKW17, the fastest older algorithm. Concerning memory use, in the majority 
of cases GJKWI17 uses more memory than the others, while sometimes BO is 
the most memory-hungry. JGKW20 is much more competitive, in many cases 
even outperforming every other algorithm. 

The results show that when applied to practical cases, JGKW20 is generally 
the fastest algorithm, and even when other algorithms have similar runtimes, it 
uses almost always the least memory. This combination makes JGKW20 cur- 
rently the best option for branching bisimulation minimisation of LTSs. 
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Abstract. One important application of quantum process algebras is 
to formally verify quantum communication protocols. With a suitable 
notion of behavioural equivalence and a decision method, one can de- 
termine if an implementation of a protocol is consistent with its specifi- 
cation. Ground bisimulation is a convenient behavioural equivalence for 
quantum processes because of its associated coinduction proof technique. 
We exploit this technique to design and implement two on-the-fly algo- 
rithms for the strong and weak versions of ground bisimulation to check 
if two given processes in quantum CCS are equivalent. We then develop 
a tool that can verify interesting quantum protocols such as the BB84 
quantum key distribution scheme. 


Keywords: Quantum process algebra - Bisimulation - Verification - 
Quantum communication protocols. 


1 Introduction 


Process algebras provide a useful formal method for specifying and verifying 
concurrent systems. Their extensions to the quantum setting have also appeared 
in the literature. For example, Jorrand and Lalire [18,21] defined the Quantum 
Process Algebra (QPAlg) and presented a branching bisimulation to identify 
quantum processes with the same branching structure. Gay and Nagarajan [15] 
developed Communicating Quantum Processes (CQP), for which Davidson [6] 
established a bisimulation congruence. Feng et al. [10] have proposed a quan- 
tum variant of Milner’s CCS [23], called qCCS, and a notion of probabilistic 
bisimulation for quantum processes, which is then improved to be a general no- 
tion of bisimulation that enjoys a congruence property [12]. Later on, motivated 
by [25], Deng and Feng [9] defined an open bisimulation for quantum processes 
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that makes it possible to separate ground bisimulation and the closedness un- 
der super-operator applications, thus providing not only a neater and simpler 
definition, but also a new technique for proving bisimilarity. In order to avoid 
the problem of instantiating quantum variables by potentially infinitely many 
quantum states, Feng et al. [11] extended the idea of symbolic bisimulation [17] 
for value-passing CCS and provided a symbolic version of open bisimulation for 
qCCS. They proposed an algorithm for checking symbolic ground bisimulation. 

In the current work, we consider the ground bisimulation proposed in [9]. We 
put forward an on-the-fly algorithm to check if two given processes in qCCS with 
fixed initial quantum states are ground bisimilar. The algorithm is simpler than 
the one in [11] because the initial quantum states are determined for the former 
but can be parametric for the latter. Moreover, in many applications, we are only 
interested in the correctness of a quantum protocol with a predetermined input 
of quantum states. This is especially the case in the design stage of a protocol 
or in the debugging of a program. 

The ground bisimulation defined in [9] is a notion of weak bisimulation be- 
cause a strong transition can be matched by a weak transition where invisible 
actions are abstracted away. We also consider a strong version where all ac- 
tions are visible, for which we have a simpler algorithm. Both algorithms are 
obtained by adapting the on-the-fly algorithm for checking probabilistic bisimu- 
lations [8,7], which in turn has its root in similar algorithms for checking classical 
bisimulations [14,17]. The basic idea is as follows. A quantum process with an 
initial quantum state forms a configuration. We describe the operational be- 
haviour of a configuration as a probabilistic labelled transition system (pLTS), 
where probabilistic transitions arise naturally because measuring a quantum sys- 
tem can entail a probability distribution of post-measurement quantum systems. 
Ground bisimulations are a strengthening of probabilistic bisimulations by im- 
posing some constraints on quantum variables and the environment states of 
processes. The skeleton of the algorithm for the strong ground bisimulation re- 
sembles to that for strong probabilistic bisimulation [8]. The algorithm for the 
(weak) ground bisimulation is inspired by [28] and uses as a subroutine a proce- 
dure in the aforementioned work. The procedure reduces the problem of finding 
a matching weak transition to a linear programming problem that can be solved 
in polynomial time. We have developed a tool that implements both algorithms 
and can check if two given configurations are strongly or weakly bisimilar. It 
is useful to validate whether an implementation of a protocol is equivalent to 
the specification. We have conducted experiments on a few interesting quantum 
protocols including super-dense coding, teleportation, secret sharing, and sev- 
eral quantum key distribution protocols, in particular the BB84 protocol [5], to 
analyse the functional correctness of the protocols. 


Other related work Ardeshir-Larijani et al. [3] proposed a quantum variant of 
CCS to describe quantum protocols. The syntax of that variant is similar to 
qCCS but its semantics is very different. The behaviour of a concurrent pro- 
cess is a finite tree and an interleaving is a path from the root to a leaf. By 
interpreting an interleaving as a superoperator [26], the semantics of a process 
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is a set of superoperators. The equivalence checking between two processes boils 
down to the equivalence checking between superoperators, which is accomplished 
by using the stabiliser simulation algorithm invented by Aaronson and Gottes- 
man [1]. Ardeshir-Larijani et al. have implemented their approach in an equiva- 
lence checker in Java and verified several quantum protocols from teleportation 
to secret sharing. However, they are not able to handle the BB84 quantum key 
distribution protocol because its correctness cannot be specified as an equiva- 
lence between interleavings. Our approach is based on ground bisimulation and 
keeps all the branching behaviour of a concurrent process. Our algorithms for 
checking ground bisimulations are influenced by the on-the-fly algorithm of Hen- 
nessy and Lin for value-passing CCS [17]. We are inspired by the probabilistic 
bisimulation checking algorithm of Baier et al. [4] for the strong version of ground 
bisimulation, and by the weak bisimulation checking algorithm of Turrini and 
Hermanns [28] for the weak version. 

Kubota et al. [20] implemented a semi-automated tool to check a notion of 
symbolic bisimulation and used it to verify the equivalence of BB84 and another 
quantum key distribution protocol based on entanglement distillation [27]. There 
are two main differences between their work and ours. (1) Their tool is based on 
equational reasoning and thus requires a user to provide equations while our tool 
is fully automatic. (2) Their semantic interpretation of measurement is different 
and entails a kind of linear-time semantics for quantum processes that ignores 
the timepoints of the occurrences of probabilistic branches. However, we use a 
branching-time semantics. For instance, the occurrence of a measurement before 
or after a visible action is significant for our semantics but not for the semantics 
proposed in [20]. 

Besides equivalence checking, based on either superoperators or bisimulations 
as mentioned above, model checking is another feasible approach to verify quan- 
tum protocols. For instance, Gay et al. developed the QMC model checker [16]. 
Feng et al. implemented the tool QPMC [13] to model check quantum programs 
and protocols. There are also other approaches for verifying quantum systems. 
Abramsky and Coecke [2] proposed a categorical semantics for quantum pro- 
tocols. Quantomatic [19] is a semi-automated tool based on graph rewriting. 
Ying [30] established a quantum Hoare logic, which has been implemented in a 
theorem prover [22]. 

'The rest of the paper is structured as follows. In Section 2 we recall the 
syntax and semantics of the quantum process algebra qCCS. In Section 3 we 
present an algorithm for checking ground bisimulations. In Section 4 we report 
the implementation of the algorithm and some experimental results on verifying 
a few quantum communication protocols. Finally, we conclude in Section 5 and 
discuss some future work. 


2 Quantum CCS 


We introduce a quantum extension of classical CCS (qCCS) which was originally 
studied in [10,29,12]. Three types of data are considered in qCCS: as classical 
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qu(nil) = 0 qu(r.P) = qv(P) 
qu(c?a.P) = qu(P) qu(cle.P) = qu(P) 
qu(c?g.P) = qv(P) — (a) qu(clg-P) = qu(P) U (aj 
qv(£[g.P) =qu(P)Uq@ — qv(Mlg;z].P) = qu(P) Ug 
qu(P +Q) = qu(P)Uqu(Q) (P || Q) = qv(P)U qu(Q) 
) (P) ) 
) (P) 


qu(A(q#)) = q. 


Fig. 1. Free quantum variables 


data we have Bool for booleans and Real for real numbers, and as quantum data 
we have Qbt for qubits. Consequently, two countably infinite sets of variables 
are assumed: cVar for classical variables, ranged over by z,y,..., and qVar for 
quantum variables, ranged over by q, r,.... We assume a set Exp, which includes 
cVar as a subset and is ranged over by e, e’,..., of classical data expressions over 
Real, and a set of boolean-valued expressions BEzxp, ranged over by 0,b’,..., 
with the usual boolean constants true, false, and operators =, ^, V, and >. 
In particular, we let e ra e’ be a boolean expression for any e,e’ € Exp and 
m € {>,<,>,<,=}. We further assume that only classical variables can occur 
freely in both data expressions and boolean expressions. Two types of channels 
are used: cChan for classical channels, ranged over by c,d,..., and qChan for 
quantum channels, ranged over by c,d,.... A relabelling function f is a map 
on cChan U qChan such that f(cChan) C cChan and f(qChan) C qChan. 
Sometimes we abbreviate a sequence of distinct variables q1, ..., qn into q. 
'The terms in qCCS are given by: 


P,Q := nil | 7.P | c?z.P | cle.P | c?q.P | etq.P | £|g.P | Mlg;x].P | 
P+Q | PIIQ | Pf] | PL | ifbthen P | A(j2) 

where f is a relabelling function and L C cChan U qChan is a set of channels. 
Most of the constructors are standard as in CCS [23]. We briefly explain a few 
new constructors. The process c?q.P receives a quantum datum along quantum 
channel c and evolves into P, while c!q.P sends out a quantum datum along 
quantum channel c before evolving into P. The symbol € represents a trace- 
preserving super-operator applied on the quantum system referred to by the 
variables g. The process M [g; z].P measures the state of qubits ¢ according to 
the observable M and stores the measurement outcome into the classical variable 
x of P. 

Free classical variables can be defined in the usual way, except for the fact 
that the variable x in the quantum measurement M [d; x] is bound. A process P 
is closed if it contains no free classical variable, i.e. fu(P) = 0. 

'The set of free quantum variables for process P, denoted by qv(P) can be 
inductively defined as in Figure 1. For a process to be legal, we require that 


1. q € qu(P) in the process c!q.P; 
2. qu(P) n qv(Q) = 0 in the process P || Q; 
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(C-Inp) 
(Tau) v € Real 
(T.P, p) — (P, p) (c?z.P, p) S5 (Pw/zx), p) 
(C-Outp) (C-Com) : 
v = [e] (P, p) —> (Pr, p) (Po, p) > (P5, p) 
(cle.P, p) 95 (P, p) (Pi || P2, p) — (Pi || P2, e) 
(Q-inp) 
r € qu(c?q. ee (Q-Outp) 
(c?q.P, p) 5 (Plr/q]. p) (cla.P, p) = (P, p) 
(Q-Com) 
(ip) <> (Pho) (Pao) <> (Pho) (own 
(Pi || P2, p) — (Pi || Pa, p) (£[a]-P, p) — (P, Ex(o)) 
(Meas) 
M= er A EŻ Ppi = ir(Exp) 
(Mi; z].P, p) — Vier ri PD/2], Ep EG /pi) 
(Int) (Sum) 
(Pi,p) => A qbu(a) N qu(P2) = 0 (f, pg) “+ A 
(P || P5, p) — A || Pe (P, + Pop) = A 
(Rel) (Res) 
(P,p) ^ A (P,p) C^ A en(o)n L - 0 
(PLI, o) 79? tfj Hug sb 
(Cho) (Cons) 
(P,p) “+A [b] = true (P[v/z,7/q, p) > A A(T, q) := P 
(if b then P, p) “> A (A(0, 7), p) “> A 


Fig. 2. Operational semantics of qCCS. Here in rule (C-Outp), [e] is the evaluation of 
e, and in rule (Meas), Ez denotes the operator E* acting on the quantum systems dq. 


) has a defining equation A(q; 3) := P, where P is a 
a 


3. Each constant A(d; 
) nd fu(P) C i. 


T 
term with qv(P) C q 


The first condition says that a quantum system will not be referenced after it 
has been sent out. This is a requirement of the quantum no-cloning theorem. 
The second condition says that parallel composition || models separate parties 
that never reference a quantum system simultaneously. 


Throughout the paper we implicitly assume the convention that processes 
are identified up to a-conversion, bound variables differ from each other and 
they are different from free variables. 


Before introducing the operational semantics of qCCS processes, we review 
the model of probabilistic labelled transition systems (pLTSs). Later on we will 
interpret the behaviour of quantum processes in terms of pLTSs because quan- 
tum measurements give rise to probability distributions naturally. 
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We begin with some notations. A (discrete) probability distribution over a 
set S is a function A: S — [0,1] with $ es A(s) = 1; the support of such a A is 
the set [A] = {s € S | A(s) > 0}. The point distribution 5 assigns probability 
1 to s and 0 to all other elements of S, so that [s] = {s}. We only need to 
use distributions with finite supports, and let Dist(S) denote the set of finite 
support distributions over S, ranged over by A, O, etc. If 5 ^, c x Pk = 1 for some 
collection of pj, > 0, and the A; are distributions, then so is Syek Pk - Ay with 


Q ex Pk: Ar)(8) = rex Pk Ar(s). 


Definition 1. A probabilistic labelled transition system is a triple (S, Act, >), 
where S is a set of states, Act, is a set of visible actions Act augmented with the 
invisible action T, and — C S x Act, x Dist(S) is the transition relation. 


We often write s + A for (s,a, A) € —. In pLTSs we not only consider 
relations between states, but also relations between distributions. Therefore, we 
make use of the lifting operation below [7]. 


Definition 2. LetR C SxS be a relation between states. Then R^ C Dist(S) x 
Dist(S) is the smallest relation that satisfies the two rules: (i) s 'R. s' implies 
8 R° s'; (ii) Ai R? Oi for alli € I implies (icp pi Ai) R^ (ies Pi Oi) for 
any pi € [0,1] with X icr pi = 1, where I is a finite index set. 


We apply this operation to the relations —— in the pLTS for o € Act,, where 

è a a.\° i a 
we also write —> for (—+) . Thus as source of a relation —> we now also allow 
distributions. But note that 5 -> A is more general than s -> A because if 


3 —5 A then there is a collection of distributions A; and probabilities p; such 
that s “> A; for each i € I and A = ier Pi Ai with ep pi = 1. 


We write s L5 A if either s > A or A=3. We define weak transitions <> 
by letting — be the reflexive and transitive closure of + and writing A => O 


for a € Act whenever A -SÂ 9. If A=Fisa point distribution, we often 


write s => O instead of s => ©. 

We now give the semantics of qCCS. For each quantum variable q we assume 
a 2-dimensional Hilbert space H,. For any nonempty subset S C qVar we write 
Hs for the tensor product space Ces H, and Hy for Bags Hga. In particular, 
H = Havar is the state space of the whole environment consisting of all the 
quantum variables, which is a countably infinite dimensional Hilbert space. 

Let P be a closed quantum process and p a density operator on 71! , the pair 
(P, p) is called a configuration. We write Con for the set of all configurations, 
ranged over by C and D. We interpret qCCS with a pLTS whose states are all the 
configurations definable in the language, and whose transitions are determined 
by the rules in Figure 2; we have omitted the obvious symmetric counterparts 
to the rules (C-Com), (Q-Com), (Int) and (Sum). The set of actions Act takes 
the following form, consisting of classical/quantum input/output actions. 


Act = (c?v, clu | c € cChan, v € Real} U {c?r, clr | c € qChan,r € qVar} 


1 As H is infinite dimensional, p should be understood as a density operator on some 
finite dimensional subspace of H which contains ?1,, (pj. 
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We use cn(a) for the set of channel names in action a. For example, we have 
cn(c?a) = (c) and cn(r) = 0. 

In the first eight rules in Figure 2, the targets of arrows are point distribu- 
tions, and we use the slightly abbreviated form C > C’ to mean € -> C’. 

The rules use the obvious extension of the function || on terms to configu- 
rations and distributions. To be precise, C || P is the configuration (Q || P, p) 
where C = (Q, p), and A || P is the distribution defined by: 


(A || P)Q, p)) € io :p)) ed P for some Q 


Similar extension applies to A[f] and A\L. 

Suppose there is a configuration C = (P, p), the partial trace over system 
P at such state can be defined as trq4(py(p) whose result is a reduced density 
operator representing the state of the environment. We give the definition of 
ground bisimulation and bisimilarity as follows. 


Definition 3 ([9]). A relation R C Con x Con is a ground simulation if for 
any C = (P, p}, D = (Q,o), C R D implies that qu(P) = qv(Q), trap) lp) = 
trato) (o), and 


— whenever C — A, there is some distribution O with D — O and AR? O. 


A relation R is a ground bisimulation if both R and R.-! are ground simulations. 
We denote by ~ the largest ground bisimulation, called ground bisimilarity. If 


the above weak transition D = O is replaced by a strong transition D > 0, 
we obtain a strong ground bisimulation. 


In the rest of the paper, we mainly focus on ground bisimulation and only 
briefly mention the algorithm for checking strong ground bisimulation. 


3 Algorithm 


We present an on-the-fly algorithm to check if two configurations are ground 
bisimilar. 

The algorithm maintains two sets NonBisim and Bisim to keep non-bisimilar 
and bisimilar state pairs, respectively. When the algorithm terminates, Bisim 
should contain all the state pairs satisfying the bisimulation relation. 

The function Bisim(t, u), as shown in Algorithm 1, is the main function of 
the algorithm, which attempts to find the smallest bisimulation containing the 
pair (t, u). It initialises Bisim and a set named Visited to store the visited 
pairs, then calls the function Match to search for a bisimulation. The function 
Match(t, u, Visited) invokes a depth-first traversal to match a pair of states 
(t,u) with all their possible behaviours. The set Visited is updated before the 
traversal for detecting loops. We also match the behaviours of t and u from both 
directions as we are checking bisimulations. Two states are deemed non-bisimilar 
in three cases: 
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— one state has a transition that cannot be matched by any possible weak 
transition from the other; 

— they do not have the same set of free quantum variables; 

— the density operators of them corresponding to their quantum registers are 
different. 


The first case is checked by MatchAction, and the other two are done in 
Match. We add a pair of states to NonBisim if one of the three cases above 
has occurred. Otherwise, it will be stored in Bisim. 

An auxiliary function Act(t) is invoked in Match to discover the next action 
that t can perform. If t have no more action to perform the function will return 
an empty set. 

The function MatchA ction(oa, t, u, Visited) checks the equivalence of con- 
figurations through comparing their transitions. The function recursively discov- 
ers the next equivalent state pairs between the target states of the transitions. 
Technically, it checks the condition that if t +» A then there exists some O 


such that u == O and A R° ©. Here we use as a subroutine a procedure of 
[28] to reduce the problem to a linear programming problem that can be solved 
in polynomial time. The problem is defined in Appendix. In MatchAction, we 
introduce a predicate LP(A, u,a, R) which is true if and only if the linear pro- 
gramming problem has a solution. We invoke the function Close to construct 
an equivalence relation R between S and the states in the support of the target 
distribution. Note that in Lines 28 and 34 we have two distinct cases because 
in output actions the emitted values are required to be equal, which are unlike 
other types of actions. 

In general, there are loops in pLTSs. When a state pair to be considered 
is already contained in Visited it will be assumed to be bisimilar and added 
to Assumed (Lines 42-43). Later on, if the pair of states are found to be non- 
bisimilar, the pair will be added to Non Bisim and a wrong assumption exception 
(Lines 18-21) will be raised to restart the checking process from the original pair 
of states. Then Bisim(t, u) renews the sets Bisim, Visited and Assumed to 
remove the pairs checked under the wrong assumption (Lines 4-6). 


Algorithm 1 Checking ground bisimulation 


Require: Two pLTSs with initial configurations t and u. 
Ensure: A boolean value bres indicating if the two pLTSs are ground bisimilar. 


1: function GroundBisimulation(t, u) = 

2 NonBisim := 0 

3 function Bisim(t, u) = try { 

4: Bisim := 0 

5: Visited := 0) 

6: Assumed := () 

7 return Match(t,u, Visited) 

8 } catch WrongAssumptionException => Bisim(t, u) 


9: 
10: function Match(t, u, Visited) > t= (P,p) and u — (Q,o) 
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Ld: 
12: 


13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 


Visited:=V isited U {(t, u)} 
b:=Nacact(t) MatchAction(a,t,u, Visited) 
b:— oe Act(u) MatchAction(a,u,t, Visited) 
be, =qv(P) = qv(Q) 
bes =trqu(P) (p) = trqu(P) (e) 
bres:=b A b A be, A bo, 
if bres is tt then Bisim = Bisim U ((t,u)) 
else if bes is ff then 

NonBisim = NonBisim U ((t,u)) 

if (t, u) € Assumed then 

raise WrongAssumptionException 

return bres 


24: function MatchAction(a, t, u, Visited) 


25: 
26: 
27: 
28: 


29: 
30: 


31: 


32: 
33: 
34: 


35: 
36: 
3T: 


38: 
: function Close(t, u, Visited) 
40: 
41: 
42: 
43: 
44: 
45: 


39 


switch o do 
case c! 
for t ^5 A; do 
Assume {tk j c[A,] and {u,} cle! 
u= D'^e;zej;^uj;€[T'| 
R:= {(tk, uj)|Close(ts, uj, Visited) = tt} 
0:=LP(4;, u, a, R) 
return A, 0; 
otherwise 
for t > A; do 
Assume {tk }:,¢fa,] and TuS rauer 
R:= {(tk, uj)|Close(tx, uj, Visited) = tt} 
0:=LP(4;, u, a, R) 
return A, 6; 


if (t, u) € Bisim then return tt 
else if (t,u) € NonBisim then return ff 
else if (t, u) € Visited then 
Assumed = Assumed U {(t, u)} 
return tt 
else return Match(t, u, Visited) 


Now let us prove the termination and correctness of the algorithm. 


Theorem 1 (Termination). Given two configurations t and u, the function 
GroundBisimulation(t,u) always terminates. 


Proof. The algorithm starts with two empty sets NonBisim and Bisim. The 
next action to perform is detected in Match. Then it invokes function MatchAc- 
tion to find the next new pair of configurations and recursively call function 
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Match to check them. Once a state pair is checked to be non-bisimilar in 
function Match, it is added into NonBisim. Meanwhile, if it is also con- 
tained in the set Assumed, the algorithm restarts a new execution of Bisim. 
Let k denote the number of executions of Bisim, and NonBisim; be the 
set NonBisim at the end of Bisim;. It is easy to show by induction that 
NonBisimy C NonBisimy41 for any k > 0. Since the system under consid- 
eration is finite-state, there always exists some n such that NonBisim,, is the 
largest set of non-bisimilar state pairs and Bisim,, is the last execution of Bisim. 

After the execution of Bisim,,, no more exceptions will be raised. Each time 
Match is executed with t and u as its parameters, we add (t, u) into Visited. 
The quantum variables and the configurations of the quantum registers for t and 
u are compared. When no more state pairs are added into Visited, the function 
Match will not be invoked again and the whole algorithm will terminate. 


Theorem 2 (Correctness). Given two configurations t and u from two pLTSs, 
Bisim(t,u) returns true if and only if they are ground bisimilar. 


Proof. Let Bisim,, be the last execution of Bisim. Let Non Bisim, and Bisimy, 
be the values of the two sets NonBisim and Bisim, respectively, recording the 
checked state pairs at the end of Bisim,. By inspecting Match, we know that 
NonBisim, N Bisim, = (). 

Let us analyse the result returned by Bisim,, which is the output of the 
function call Match(t, u, Visited). If the result is false then one of the conjuncts 
in bres is invalid, which means that one of the three cases discussed in the 
beginning of Section 3 occurs, thus t and u are indeed non-bisimilar. If the 
return is true then there is Bisim, = Visited,\NonBisim,. For each pair 
(t,u) € Bisim,, all the conjuncts in bres must be true. Both t and u must 
have the same set of free quantum variables and the same density operators. In 
addition, they have matching transitions. That is, for any action a, if t > A 
then there exists some weak distribution © such that u == O and A R° O. 
This is true because (i) the relation R in function MatchAction is correctly 
constructed, and (ii) the lifted relation R° exists. Below we argue for (i); the 
existence of the lifting operation in (ii) relies on the validity of the predicate LP 
whose correctness is established by Theorem 9 in [28]. 

The algorithm adds a pair into Assumed, if the pair to be checked has al- 
ready been visited and passed the bisimulation checking conditions. It implies 
that Assumed, C Visited,. Furthermore, as there is no wrong assumption 
detected after the execution of Bisim,, we have Assumed, C Bisim, which 
implies that Bisim, = Assumed, U Bisim,. So Bisim, constitutes a bisimu- 
lation relation containing the initial state pair (t, u). 


Before concluding this section, we analyse the time complexity of the algo- 
rithm. 


Theorem 3 (Complexity). Let the number of configurations reachable from t 
and u be n. The time complezity of function Bisim(t,u) is polynomial in n. 
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Proof. The number of state pairs is at most n?. The number of state pairs 
examined in the kth execution of Bisim is at most O(n? — k). Therefore, the 
total number of state pairs examined is as most O(n? - (n? — 1) 4-...-- 1) = O(n?). 
Note that each state has finitely many outgoing transitions. Given a transition, 
to check if there exists a weak matching transition, we call the function LP at 
most once, the construction of a flow network and solving the linear programming 
problem are both polynomial in n if we use the algorithm in [28]. Consequently, 
the whole algorithm is also polynomial in n. 


For the strong version of ground bisimulation, we are only concerned with 
the matching of strong transitions. Therefore, Algorithm 1 can be simplified and 
there is no need of the predicate LP in the function MatchAction. 


4 Implementation and Experiments 


In this section, we report on an implementation of our approach and provide the 
experimental results of verifying several quantum communication protocols. 


Implementation, AST pLTS pLTSs NE Mo Strong Bisimilar 
Variable Initialisation, Parser | .——»| | Generation | > Checking Configuration 
Operator Definition Module EE Module Pairs 
Specification, Weak — — 
Variable Initialisation, Bisimulation E. 
Operator Definition Checking EA irs 
Module 


Fig. 3. Verification workflow. 


4.1 Implementation 


We have implemented both strong and weak ground bisimulation checkers in 
Python 3.7. The workflow of our tool is sketched in Figure 3. The tool consists 
of a pLTS generation module and two bisimulation checking modules, devoted 
to modeling and verification, respectively. The input of this tool is a specifica- 
tion and an implementation of a quantum protocol, both described as qCCS 
processes, the definition of user-defined operators, as well as an initialisation of 
classical and quantum variables. Unlike classical variables, the initialisation of 
all quantum variables, deemed as a quantum register, is accomplished at the 
same time so to allow for superposition states. The final output of the tool is a 
result indicating whether the specification and the implementation are bisimilar 
under the same initial states. The algorithm also stores the bisimilar state pairs 
and non-bisimilar state pairs in two tables. 

The pLTS generation module acts as a preprocessing unit before the verifica- 
tion task. It first translates the input qCCS processes into two abstract syntax 
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trees (ASTs) by a parser. Then the ASTs are transformed into two pLTSs ac- 
cording to the operational semantics given in Figure 2, using the user-defined 
operators and the initial values of variables. The weak bisimulation checking 
module implements the weak ground bisimilarity checking algorithm we defined 
in the last section. It checks whether the initial states of the two generated pLTSs 
are weakly bisimilar. 

The tool is available in [24], where we also provide all the examples for the 
experiments to be discussed in Section 4.3. 


4.2 BB84 Quantum Key Distribution Protocol 


To illustrate the use of our tool, we formalise the BB84 quantum key distribution 
protocol. Our formalisation follows [11], where a manual analysis of the protocol 
is provided. Now we perform automatic verification via the ground bisimulation 
checker. 

The BB84 protocol provides a provably secure way to create a private key 
between two partners with a classical authenticated channel and a quantum inse- 
cure channel between them. The protocol does not make use of entangled states. 
It ensures its security through the basic property of quantum mechanics: if the 
states to be distinguished are not orthogonal, such as |0) and |+), then informa- 
tion gain about a quantum state is only possible at the expense of changing the 
state. Let the sender and the receiver be Alice and Bob, respectively. The basic 
BB84 protocol with a sequence of qubits q with size n goes as follows: 


1. Alice randomly generates two sequences of bits B, and K, using her qubits 
q. Note that d here are auxiliary qubits which are not modified in this step. 

2. Alice sets the state of d, such that the ith bits of d is |ry) where x and 
y are the ith bits of B, and Ka, and respectively, |09) = |0), |01) = |1), 
|o) = |+) = (10) + 1)/ v2 and |11) —|-) = (10) = 11)/ v2. 

3. Alice sends her qubits d to Bob. 

4. Bob randomly generates a sequence of bits B; using his qubits q/. 

5. Bob measures the ith qubit of g he received from Alice according to the 
basis determined by the ith bit of By. Respectively, the basis is {|0}, |1)} if 
it is 0 and {|+),|—)} if it is 1. 

6. Bob sends his choice of measurements B, to Alice, and after receiving the 
information, Alice sends her B, to Bob. 

T. Alice and Bob match two sequences of bits B, and B, to determine at which 
positions the bits are equal. If the bits match, they keep the corresponding 
bits of Ka and Kj. Otherwise, they discard them. 


After the execution of the basic BB84 protocol, the remaining bits of K, and 
K, should be the same, provided that the communication channels are perfect 
and there is no eavesdropper. 


Implementation. For simplicity, we assume that the sequence 4 consists of only 
one qubit. This is enough to reflect the essence of the protocol. The other qubits 
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used below are auxiliary qubits for the operation Ran. 


Alice “! Ran|q1; Bal-Ran|q1; Kal-Setx, |a].Hp, |a] A2BIa. 
02a? By.a2b!By.keya!cmp( Ka, Ba, By). nil; 


Bob *! A2B?q,.Ran|qo; Bs].Mp, a1; Ko]-b2a! By. 
a2b? B,.keyp!cmp( Ky, Ba, By) nil; 
BB84 “=! ( Alice|| Bob) XV {a2b, b2a, A2B} 


where there are several special operations: 


— Ran[q; x] = Set+[q|-Mo,i1lg; x].Seto[q], where Set, (resp.Seto) is the oper- 
ation which sets a qubit it applies on to |+) (resp.|0)), Mo,1[¢;2] is the 
quantum measurement on q according to the basis {|0),|1)} and stores the 
result into x. 

— Set |q] sets the qubit q to the state |K). 

— Hp|g] applies H or does nothing on the qubit q depending on whether the 
value of B is 1 or 0. 

— Mpg]|g; K] is the quantum measurement on q according to the basis {|+}, |—)} 
or {|0),|1)} depending on whether the value of B is 1 or 0. 

— cmp(z, y, z) returns x if y and z match, and e, meaning it is empty, if they 
do not match. 


Specification. The specification can be defined as follows using the same opera- 
tions: 


BB84spec 4 Ran[q; B]. Ran[g; Ka]-Ran|qz; By] 
(keyalemp( Ka, Ba, Bo)-nill|keyp!emp(Ka, Ba, By)-nil). 


Input. For the implementation of B B84, we need to declare the following vari- 
ables and operators in the input attached to it. 


— The classical bits are named Ba, Ka for Alice and By, Ky for Bob. 
— The qubits are declared together as a vector |q1, q2). The vector always needs 
an initial value. We can set it to be |00) in this example. 


When modelling the protocol, we use several operators. They should be defined 
and their definitions are part of the input. 


— The operator Ran involves two operators Set}, Seto and a measurement 

Mo,1 measuring the qubit according to the basis {|0),|1)}. 

Setx needs Seto and Seti. 

— Hg requires the Hadamard gate H. 

— Mg uses the measurement M.,... which measures the qubit according to the 
basis {|+),|—)}. 


The function cmp is treated as an in-built function, so there is no need to define 
it in the input. 

For the specification B B84,,.., we only declare the classical bits Ba, By, Ka, 
qubits q1, q2 and the operator Ran. The variables and operators declared here 
are the same as those in the input of the implementation. 
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Output. Taking the input discussed above, the tool first generates two pLTSs, 
with over 150 states for the implementation and 80 states for the specification, 
and then runs the ground bisimulation checking algorithm. As we can see from 
the fifth row in Table 1, our tool confirms that (BB84, po) © (BB84spec, Po); 
where po denotes the initial state of the quantum register, thus the implemen- 
tation is faithful to the specification. In the output of the tool, there is an enu- 
meration of 1084 pairs of non-bisimilar states and 3216 pairs of bisimilar states. 
The pLTSs and the state pairs can be found in [24]. 


4.3 Experimental Results 


We conducted experiments on several quantum communication protocols with 
a few different input variables. Table 1 provides a summary of our experimental 
results obtained on a macOS machine with an Intel Core i7 2.5 GHz processor 
and 16GB of RAM. 


Weak ground bisimulation 


Program Variables BisilImplSpec N B ms 
Super-dense qıq2 = |00), r = 1 Yes| 16 5 9 20 259 
coding qıq2 = |00), 2 = 5 No| 6 - - 2 


Super-dense 


coding (modified) qd, = |00), 2 —5 Tun 8 


q1q2qa = |100) Yes| 34 


Teleportation || qigaq3 = 251000) + 7 [100) Yes! 34 


qıq2q3 = Y3|000).2|100) | Yes | 34 239 


q1d2d5d4 = |1000) Yes | 103 65 | 65 | 1339 


Secret Sharing q1q2q3qa= 50000) + 7 |1000) Yes | 103 65 65 1252 


wj wi w| CW] GW] C2| C" |n 
N 
N 
N 
N 


q1q2q3qa= X? |0000) +£ |1000) Yes | 103 65 65 1187 


BB84 qıq2 = |00) Yes| 152 | 80 | 1084 | 3216] 130163 
BER qıq2q3 = |000) Yes | 1180 | 352 |121072|75392|55728587 
eavesdropper) 
B92 qıq2 = |00) Yes! 64 80 466 | 1284) 34522 
E91 q1q2q3q4 = |0000) Yes! 124 | 80 | 964 |2676 | 113840 


Table 1. Experimental results. The columns headed by Impl and Spec show the 
numbers of nodes contained in the generated pLTSs of the implementations and speci- 
fications, respectively. Column N shows the sizes of the sets of non-bisimilar state pairs 
and Column B shows the sizes of the sets of bisimilar state pairs. Column ms shows 
the time cost of the verification in milliseconds. 


In each case, we report the final outcome (whether an implementation is 
ground bisimilar to its specification), the number of nodes in two pLTSs, the 
numbers of non-bisimilar and bisimilar state pairs in NonBisim and Bisim, 
respectively, as well as the verification time of our ground bisimulation checking 
algorithm. The time cost excludes the part of pLTS generation which takes 
around one second in all the examples. 
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Besides the protocol discussed in Section 4.2, we also verify other ones that 
make use of entangled qubits such as the teleportation and the quantum secrect 
sharing protocol. For quantum key distribution protocols, we conduct experi- 
ments on the BB84, the B92 and the E91. 

Not all the cases in Table 1 give the size of the set Non Bisim of non-bisimilar 
state pairs, as the bisimulation checking algorithm may immediately terminate 
once a negative verification result is obtained, i.e. the two initial states are not 
bisimilar. 


Data Availability Statement 
The datasets generated and/or analyzed during the current study are available 
in the figshare repository: https://doi.org/10.6084/m9.figshare.11874942.v1. 


5 Conclusion and Future Work 


We have presented an on-the-fly algorithm to check ground bisimulation for 
quantum processes in qCCS, and a simpler algorithm for strong ground bisim- 
ulation. Based on the algorithms, we have developed a tool to verify quantum 
communication protocols modelled as qCCS processes. We have carried out ex- 
periments on several non-trivial quantum communication protocols from super- 
dense coding to key distribution and found the tool helpful. 

As to future work, several interesting problems remain to be addressed. For 
example, a limitation of the current work is to compare quantum processes with 
predetermined states of quantum registers. Indeed, there are occasions where 
one would expect two processes to be equivalent for arbitrary initial states. 
It is infeasible to enumerate all those states. Then the symbolic bisimulations 
proposed in [11] will be useful. We are considering to implement the algorithm 
for symbolic ground bisimulation, and then tackle the more challenging symbolic 
open bisimulation, both proposed in that work. Another problem occurs in the 
experiment of Section 4.2. The example tested one qubit instead of a sequence of 
qubits because more qubits lead to a drastic growth of the running time, which 
shows a limitation of the current approach of explicitly representing state spaces. 


Appendix 


Algorithm 1 needs to check the condition that if t “+> A then there exists some 


O such that u => O and A R° O. We use as a subroutine a procedure of [28] to 
reduce the problem to a network flow problem that can be solved in polynomial 
time. 

Technically, we construct a network graph G(A, u, o, R) = (V, E) defined as 
follows. Let S be the set of reachable states, and R be a binary relation on the 
states. 

Let A and V be two vertices that represent the source and the sink of the 
network, respectively. For each visible action a, the set of vertices V is given 
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below 
V={A, V} U S U S" U Sa U S$ US, USR 


where 
gtr = (v^ |tr =v EO I, B € [OT EH 


Sa = {valv € S); 
57 = {oi v" € gry; 
S, = {viv E S}; 


SR = {urlu € Sh 
and the set of edges E is 
where 
Li = {(v, v), (v^, v')|tr 2 v >T, v' € [D] 
= ((v v), QE Valtr =v >T, va € [T] 
Le = {(v9, 0°), (VF v ltr = va — T, v, € [D]); 
| = {(ua u)lu e S}; 
Lr = {(51; 5R); (SRY MCs ;8) ER}. 
For the invisible action 7, the definition is similar: V = {A, V}USUS™US,USR 
and E = {(A,u)}UL, UL, U Lr where Li = ((s,si)| s € S}. 
If a is a visible action, we consider the following linear programming problem 
associated to G(A, u,a, R): 


max , = 
(s,v)EE fs 


subject to 

fs 20 for each (s,v) € E 
fau =1 

for, = A(v) for each v € S 
Y ene 7 Zona tie =" for each v € V \ {A, v) 
foe — D(v)- fowtr -0 for each tr = v ++ T and v' € [T] 
for, — D (v) + foyer = 0 for each tr = v “+ P and v' € [T] 
foro, — P(v') foa vtr = 0 for each tr = v ++ I and v' € [T] 


Note that the fourth constraint is referred to as the flow-conservation constraints. 
'The last three constraints link the source state and the result distribution. 

For the invisible action 7, the linear programming problem associated to the 
network G(A, u, T, R) is the same as above except that the last two constraints 
are dropped. 

We denote by LP(A,u, o, R) the predicate that is true if and only if the 
linear programming problem above has a solution. 
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Abstract. We present an algorithm to decide the equivalence of context- 
free session types, practical to the point of being incorporated in a com- 
piler. We prove its soundness and completeness. We further evaluate its 
behaviour in practice. In the process, we introduce an algorithm to decide 
the bisimilarity of simple grammars. 


Keywords: Types, Type equivalence, Bisimulation, Algorithm 


1 Introduction 


Session types enhance the expressivity of traditional types for programming lan- 
guages by allowing the description of structured communication on heteroge- 
neously typed channels [14,15,24]. Traditional session types are regular in the 
sense that the sequences of communication actions admitted by a type are in 
the union of a regular language (for finite executions) and an w-regular language 
(for infinite executions). Introduced by Thiemann and Vasconcelos, context-free 
session types liberate traditional session types from the shackles of tail recursion, 
allowing, for example, the safe serialization of arbitrary recursive datatypes [26]. 
Session types are often used to discipline interactions in concurrent programs. 
When associated to (bidirectional, heterogeneous) channels, session types de- 
scribe the permitted patterns of interaction. For example, a type of the form 


rec x. +{Leaf: Skip, Node: !Int;x;x} 


may describe one end of a communication channel. A process holding such a 
channel end must first select between choices Leaf and Node. If Leaf is chosen, then 
type Skip forwards the interaction to the continuation, if any. If no continuation 
is present, then interaction is over. Otherwise, the process must send an integer 
(!Int) followed by two trees, as witnessed by the recursive calls occurring after 
the choice of Node. A concurrent process holding the other end of the channel 
interacts via a dual type: 


rec y. &{Leaf: Skip, Node: ?Int;y;y) 


In this case the process must be ready to offer both choices, Leaf and Node. For 
the latter option, the process must further receive an integer (?Int), followed by 
two trees. 

Regular languages cannot capture such behaviour. The best one can do with 
regular session types (and without resorting to channel passing) is to use a 


(9 The Author(s) 2020 
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regular type that allows transmitting trees, as well as many other non tree-like 
structures. The correct behaviour of processes interacting on such a channel 
would need to be checked at runtime [2,26]. 


If the algorithmic aspects of type equivalence for regular session types are well 
known (Gay and Hole propose an algorithm to decide subtyting [9], from which 
type equivalence can be derived), the same does not apply to context-free session 
types. Thiemann and Vasconcelos [26] show that the equivalence of context-free 
session types is decidable, by reducing the problem to the verification of bisim- 
ulation for Basic Process Algebra (BPA) which, in turn, was proved decidable 
by Christensen, Hiittel, and Stirling [6]. Even if the equivalence problem for 
context-free session types is known to be decidable, no algorithm has been pro- 
posed. Padovani [20] introduces a language with context-free session types that 
avoids the problem of checking the equivalence of types by requiring annotations 
in the source code. Annotations result in the structural alignment between code 
and types. This alignment—enforced by an explicit resumption process operator 
that breaks sequential composition in types—sidesteps the problem central to 
this paper: that of checking type equivalence. Furthermore, there are some basic 
equivalences on types that the compiler is not able to identify [20]. 


After the breakthrough by Christensen, Hiittel, and Stirling—a result that 
provides no immediate practical algorithm—the problem of deciding the equiv- 
alence of BPA terms has been addressed by several researchers [4,6,8,18]. Most 
of these works provide no practical algorithm that can be readily used, except 
the one by Czerwinski and Lasota where a polynomial time algorithm is pre- 
sented that decides the bisimilarity of normed context-free processes in O(n?) [8]. 
However, context-free session types are not necessarily normed, which precludes 
resorting to this algorithm, or using the original result by Baeten, Bergstra, 
and Klop [3], as well as improvements by Hirshfeld, Jerrum, and Moller [12,13]. 
Moreover, the complexity estimates for deciding bisimilarity in BPA process are 
not promising. Kiefer provided an EXPTIME lower bound for BPA bisimilarity 
by proving this problem is EXPTIME-hard [19], whereas Jančar has provided a 
double exponential upper bound for this problem and proved that its complexity 
is O(27""”) n7]. 

The decidability of deterministic pushdown automata (DPDA) has also been 
subject of much study [16,22,23]. Several techniques have been proposed to solve 
the problem, but no immediate practical algorithm was available until Henry and 
Sénizergues provide an algorithm for this problem [10]. Its poor performance 
however precludes its incorporation in a compiler. Furthermore, the algorithm 
Henry and Sénizergues propose handles the problem of language equivalence 
rather than the problem of deciding bisimilarity of DPDAs. 

Our algorithm to decide the equivalence of context-free session types also 
allows deciding the bisimilarity of simple grammars (i.e., deterministic gram- 
mars in Greibach Normal Formal). It proceeds in three stages. The first stage 
builds a context-free grammar in Greibach Normal Formal (GNF)—in fact a 
simple grammar—from two context-free session types in a way that bisimula- 
tion is preserved. A basic result from Baeten, Bergstra, and Klop states that any 
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guarded BPA system can be transformed into Greibach Normal Formal (GNF) 
while preserving bisimulation equivalence, but unfortunately no procedure is 
presented [3]. The second stage prunes the grammar by removing unreachable 
symbols in unnormed sequences of non-terminal symbols. This stage builds on 
the result of Christensen, Hüttel, and Stirling [6]. The third stage constructs 
an expansion tree, by alternating between expansion and simplification steps. 
This last stage uses expansion operations proposed by Jančar, Moller, and Hir- 
shfeld [11,18], and simplification rules proposed by Caucal, Christensen, Hüttel, 
Stirling, Janéar, and Moller [5,6,18]. The finite representation of bisimulations 
of BPA transition graphs [5,6] is paramount for our results of soundness and 
completeness. 

The branching nature of the expansion tree confers an (at least) exponen- 
tial complexity to the algorithm. However, our experiments with a concrete 
implementation—both as a stand-alone tool and incorporated in a compiler [2]— 
are promising. We propose heuristics that decrease the execution time in 89% 
and reduce the number of timeouts by 95% (see Section 5). 

We present an algorithm to decide the equivalence of context-free session 
types, practical to the point of being readily included in any compiler, an exercise 
that we conducted in parallel [2]. The main contributions of this work are: 


— The proposal and implementation of an algorithm to decide type equivalence 
of context-free session types; 

— A proof of its soundness and completeness against the declarative definition; 

— The proposal and implementation of an algorithm to decide the bisimilarity 
of simple grammars; and 

— The empirical study of the runtime behaviour of the implementation. 


The rest of the paper is organized as follows: an introduction to context-free 
session types can be found in Section 2, the algorithm in Section 3, the main 
results in Section 4, evaluation in Section 5, and conclusions in Section 6. 


2 Context-free session types 


This section briefly introduces context-free session types, based on the work of 
Thiemann and Vasconcelos [26]. The types we consider build upon a denumer- 
able set of variables and a set of choice labels. Metavariables X,Y, Z range over 
variables and £ over labels. We assume given a set of base types denoted by B. 
'The syntax of types is given by the grammar below. 


S,T u— skip | HB | (fi: Ther | ST | XT | X 
pzs]? *xu—G|& 


In type wX.T, variable X is bound in the subterm T. The sets of bound and 
free variables in a given type are defined accordingly. Notation [T'/ X]S denotes 
the resulting of substituting T for the (free) occurrences of X in S. 

Judgement Sv characterizes terminated types: context-free session types 
that exhibit no further action [1]. 
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Terminated predicate: TV 


Sy Tv py 
ST? "aU 


skipv Xv 


Notice that all types of the form pX.X1...uX,.X, for n > 0, are terminated. 

We are not interested in all types generated by the above grammar. If A is a 
list of pairwise distinct variables, then judgement A F T characterises the types 
of interest: the well-formed types. 


'Type formation system: AFT 


X€A AFS AFT AFT,;(Viel) ATV A,XCFT 
AFskip AFHB AFX  AFST  Akx(ü:Tihe AruXxT 


Terminated processes have a simple characterisation—types comprising skip, 
u and semicolon—which justifies the inclusion of =TV in the rules for type 
formation (Thiemann and Vasconcelos [26] introduce a contractive judgement for 
the effect). Type formation serves two main purposes: ensuring that all variables 
introduced by p-types are pairwise distinct and that types underneath a u are 
not terminated. This can be clearly seen by formation rule for y-types, where 
notation A, X is understood as requiring X ¢ A. In the sequel we assume that 
all types are such that F T and denote by 7 the set such types. 

'The set of actions is generated by the following grammar. 


a z= {B | x 


The labelled transition system (LTS) for context-free session types is given 
by T as the set of states, the set of actions, and the transition relation S =r T 
defined by the rules below. 


Labelled transition system: S 57T 


iB-Aosd Tha rT; Gen 
S — S! SY T ——4 T' [uX.S/X]S 7 T 
S;T S47 ST S:T 7 T' uX.S — LT 


Type bisimulation is defined in the usual way from the labelled transition 
system [21]. We say that a type relation R is a bisimulation if, whenever SRT, 
for all a we have: 


— for each S’ with S — y S’, there is T" such that T +7 T' and S"RT', and 
— for each T" with T +7 T", there is S' such that S +7 S' and S"RT'. 


We say that two types are bisimilar, written S — T, if there is a bisimulation R 
with S'RT. 
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3 An algorithm to decide type bisimilarity 


This section presents an algorithm to decide whether two types are in a type 
bisimulation. In the process we also provide an algorithm to decide the bisimi- 
larity of simple context-free languages. The algorithm comprises three stages: 


1. Translate the two types to a simple grammar, 

2. Prune unreachable symbols, and 

3. Explore an expansion tree, alternating between simplification and expansion 
operations, until finding an empty node—case in which it decides positively — 
or failing to expand all nodes—case in which it decides negatively. 


Translating types to grammars. Type variables X are the non-terminal symbols 
and LTS labels a are the terminal symbols. Sequences of type variables X are 
called words; € denotes the empty word. A context-free grammar in Greibach 
Normal Form is a pair (X,P) where X is the start word and P a set of produc- 
tions of the form Y — aZ (context-free session types do not require productions 
of the form Y — £). Due to the deterministic nature of context-free session types, 
the grammars we are interested in are simple: for each non-terminal symbol Y 
and terminal symbol a, there is at most one production of the form Y > aZ. 

Grammars in Greibach normal form naturally induce a labelled transition 
system by taking words X for states, terminal symbols a for actions, and +p, 
defined as XY — sp ZY when X > aZ € P, for the transition relation. The 
associated bisimilarity is denoted by ~p. 

The unravelling function on well-formed context-free session types, taken 
from Thiemann and Vasconcelos [26], is defined as follows. 


unr(u X.T) = unr([u.X.T/ X]T) 


op o J unr(T) unr(S) — ski 
unis d Du m unr(S) A ua 


unr(T) = T in all other cases 


'The function terminates under the assumption that types are well formed. 
Another function, word, builds a word from a type. In the process it updates 
a global set P of grammar productions. Word concatenation is denoted by X-Y. 


word(skip) = € 
word( S; T) = word(.S) - word(T) 
word(1B) = Y, setting P := PU {Y > tB) (Y fresh) 
word(«{é;: Ti}ier) = : setting P := PU {Y — xl; - word(T;) | i € I} (Y fresh) 
word( X) — 
word(u X.T) = X 


The following lemma relates terminated types to the result of a call to word. 


Lemma 1. Let | T. Then, TV if and only if word(T) = 
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Proof. The direct implication follows by rule induction on predicate v: 


— Case skipv' : word(skip) = e. 

— Case XV: if T is X, then YT. 

— Case S;TV: by induction hypothesis on the rule premises Sv and Tv, 
word(.S) = £ and word(T) = £. Hence, word(S; T) = e. 

— Case X.S: the hypothesis Tv and the rule premises of hypothesis | T are 
contradictory. 


Conversely, if word(T) = £, using the rules of the definition of word that produce 
the empty word: 


— if T is skip, then we have TV. 

— if T is U; V, word(U) = £, and word(V) = e, then, by induction, we have UV 
and VV. Hence, TV. 

— No other case in function word produces an empty word. 


'To define the translation of context-free session types to simple grammars, 
assume that ([1,X1.T1,..., UXn-Tn} is the set of all u-subterms in a given type T. 
Further assume that i < j whenever X; € free(uX;.T;). That is, the u-subterms 
are topologically sorted with respect to their lexical nesting, innermost subterms 
first. Now we identify unrolled versions of the j-subterms. 


T! = qux. Xm 


Clearly each type T7 is closed (has no free variables). Notice that if T is a p-type, 
then X,.T,, is T itself. 

Finally, given an initial set of productions Po, function grm translates a type 
T into a grammar composed of a start word and set of productions: 


grm(T, Po) = (word(T), Pn) 
where each P; is computed from P;_1 by the following recurrence, 
Pi U {Xi > a;Y;Z | (Z => a5Y;) € P!) where (ZZ, P!) = grm(unr(T]), P; 4) 


Notice that word(unr(T7)) is a non-empty word because of Lemma 1 and the 
fact that each T7 is non-terminated by hypothesis. The function grm terminates 
on all inputs (because recursion is always on subterms) and adds a finite number 
of productions to the original set. Furthermore, because choices in session types 
do not contain duplicated labels, the function returns a simple grammar. 

'To run grm on two well-formed types proceed as follows: rename the second 
type so that bound variables do not overlap with those of the first; start with 
an empty set of productions; run the algorithm consecutively on the two types 
to obtain two initial words and a single set of productions. 
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Example 1. Consider the following pair of context-free session types. 


S 5 (uXi.&(n : X1; X3; Pint, l :?int}); (uX.lint; X2; X3) 
T £ (uYy.&(n : Yi; Yi, 0: skip}; ?int); (uYs.lint; Y2) 


Starting from the empty set of productions, running grm consecutively on S and 
on T' produces the following set of productions 


Xi = &n X4 Xı X3 X3 — ?int Yı = &n YiYiY3 Yə — lint Ys 
Xi & X4 X4 — ?int Yı > &lY3 Y3 — ?int 
X» = lint XoXo 


and two start words X1 Xə and Yı Y2. 


Pruning unnormed productions. For à a non-empty sequence of non-terminal 
symbols a1,...,@n, write Y — p Z when Y “4p --- =p Z. We say that Y is 
normed when Y — p € for some d, and that Y is unnormed otherwise. When 


Y is normed, the minimal path of Y is the shortest d such that Y p e. In 
this case, the norm of Y, denoted by IY is the length of d. As observed by 
Christensen, Hüttel, and Stirling [6], any unnormed word Y is bisimilar to its 
concatenation with any other word, that is, if Y is unnormed, then Y cp YX. 
We use this fact to prune unreachable symbols in unnormed words. And we do 
this in all productions. 


Example 2. Recall Example 1 and notice that Xə and Y> are both unnormed. 
Then, the last occurrence of Xə in production Xə — !int Xo X2 is unreachable, 
hence we simplify the production to obtain Xə > lint X5. 


Building an expansion tree. We base the third stage of the algorithm on the 
notion of expansion tree as proposed by Jancar and Moller [18], adapting an 
idea by Hirshfeld [11]. The nodes in trees are labelled by sets of pairs of words. 
We say that a node N’ is an expansion of N if N’ is a minimal set such that: 
for every pair (X,Y) € N, 


— if X > aX' then Y > aY” with (X’,Y’) € N', and 
— if Y > aY' then X > aX’ with (X', Y") € N'. 


An expansion tree is built from a root node: the singleton set containing 
the pair of start words obtained by translating the two types into a grammar. 
A children node is obtained from its parent node by expansion. However, as 
Janéar and Moller observed, expansions alone often lead to infinite trees. We then 
alternate between expansion and simplification operations, until either finding 
an empty node—case in which we decide equivalence positively—or failing to 
expand all nodes—case in which we decide equivalence negatively. We say that 
a branch is successful if it is infinite or finishes in an empty node, otherwise it 
is said to be unsuccessful. 
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In the expansion step, each node N derives a single child node, obtained as 
an expansion of N. As we are dealing with simple grammars, no branching is 
expected in the expansion tree at this step. 

'The simplification step consists on the application of the following rules: 


Reflexive rule: Omit from a node any pair of the form (X "a y 
Congruence rule: Omit from a node N any pair that belongs to the least 
congruence containing the ancestors of N; 

BPA1 rule: If (XX, YoY) is in N and (XoX', YoY") belongs to the ances- 

tors of N, then create a sibling node for N replacing (px, YoY) by (X, X!) 

and (Y, Y?); 

BPA2 rule: If (Xo X, YoY) is in N and Xo and Yo are normed, then: 
Case |Xo| € |Yo|: Let d be a minimal path for Xo and Z the word such 
that Yo Hi, Z. Adda sibling node for N including the pairs (oz , Yo) 
and (X, ZY) in place of (Xo X, YoY); 

Otherwise: Let dà be a minimal path for Yọ and Z the word such that 
Xo ET Z. Add a sibling node for N including the pairs (Xo, YoŽ ) and 
(ZX, Y) in place of (XoX, YoY). 


Contrarily to expansion and to the reflexive and congruence simplifications, 
BPA rules promote branching in the expansion tree. We iteratively apply the 
simplification rules to ensure the algorithm computes the simplest possible chil- 
dren nodes derived from N. We can easily show that the simplification function 
that results from applying the reflexive, congruence, and BPA rules, has a fixed 
point in the complete partial ordered set of pairs node-ancestors, where the set 
of ancestors is fixed. The proof builds a partial order on the sets of pairs node- 
ancestors and uses Tarski's fixed point theorem [25]. The number of children 
nodes generated by the application of these rules is finite [6,18]. Notice that the 
sibling nodes do not exclude the (often) infinite branch resulting from successive 
expansions. 


Checking the bisimilarity of simple grammars. Given a set of productions and 
two start words X and Y (all pruned), function bisimG alternates between sim- 
plification and expansion stages, starting with expansion. To avoid getting stuck 
in an infinite branch of the expansion tree, we use a breadth-first search on the 
expansion tree: node-ancestor pairs to be processed are stored in a queue. The 
initial pair inserted in the queue contains the initial node (X i Y)} and an empty 
set of ancestors. 


bisimG(X, Y, P) = expand(singletonQueue(({(X, Y)}, 0), P) 


Predicate expand terminates as soon as all nodes fail to expand (signalled by 
an empty queue), case in which the algorithm returns False, or an empty node is 
reached, case in which the algorithm returns True. Otherwise, it extracts node 
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n at the front of the queue, simplifies its child node, and recurs. 


expand(q, P) = 
if empty(q) then False 
else (n,a) = front(q) 
if empty(n) then True 
else if hasChild(n, P) 
then expand(simplify(((child(n, P), a U n)}, dequeue(q), P)) 
else expand(dequeue(q), P) 

The simplification stage distinguishes the case where all type variables are 
normed, in which case BPA1 is not required to decide equivalence [5,6], from the 
case where some type variables might be unnormed. 

rules = if allProductionsNormed(P) then [reflex, congruence, bpa2] 
else [reflex, congruence, bpal, bpa2) 

Function simplify applies the various rules iteratively, until reaching a fixed 
point. The application of the rules (via function apply) produces a set of nodes 


that are then enqueued. The simplification stage does not introduce new levels 
in the tree, hence the set of ancestors na is passed to function apply as is. 


simplify(na, q, P) = fold (enqueue, q, apply(na, rules, P)) 


Example 3. The expansion tree for our running example is in Figure 1. Once a 
successful branch is reached (marked with v^), bisimG(X,Y,P) returns True. 


Checking the bisimilarity of context-free session types. Function bisimT decides 
the equivalence of two well-formed and renamed types, S and T. It starts by 
computing the start words for S and T' by first translating S to a grammar and 
enriching this with the productions for type T. After pruning the productions 
in the grammar (function prune), the equivalence of S and T' is decided using 
function bisimG. 


bisimT(T, U) = bisimG( X, Y , prune(P)) 
where (X,P’) = grm(S,0) 
(Y,P) = grm(T,P’) 


4 Correctness of the algorithm 


In this section we prove that function bisimT is sound and complete with respect 
to the meta-theory of context-free session types. We start by showing a full 
abstraction result between context-free session types and grammars in Greibach 
Normal Form. Then, based on results from Caucal [5], Christensen, Hiittel, and 
Stirling [6], Janéar and Moller [18], we conclude that the algorithm we propose 
is sound and complete. 
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(X, Y Y) 
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(X1X(X4X,, Y; Y; Y;Y;), (X4Xo, Y; Y2) 
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m (Xj, Y), (XX, YiY3) 
ee bpa2 7-.. 
(Gy, Y). (X3, Y3) 


cong | 


C2) v 


Fig. 1: An example of an expansion tree 


Type translation is fully abstract. Sections 2 and 3 introduce bisimulation rela- 
tions on the set 7 of types ~7 and on a given set P of productions ~p. Our 
ultimate goal is to prove that we can faithfully analyze the bisimilarity of types 
by analyzing the bisimilarity of the corresponding grammars. For this purpose, 
we prove that the translation proposed in Section 3 is a fully abstract encoding, 
i.e., preserves the bisimilarity relation. 

We start showing that the transformation of types to grammars preserves 
the labelled transitions. The following result states that grammars produced by 
grm mimic the transitions of the corresponding types and vice-versa. 

Lemma 2. Let (X,P’) = grm(S,0) and (Y,P) = grm(T,P’). Then, S 97 T 


E 


if and only if X př. 


Proof. For the direct implication we proceed by rule induction on the hypothesis, 
using the definition of word. 


— Case B: if {B EL skip, then word(#B) s E. 

— Case xli: if «(£;: Si}ier e Si, then word(S) A B word(.;). 

— Case S1; S2 with Si S7 Si: if S,;95 =r Si;S_ and S —4 St, 
by induction hypothesis, we have word(Si) —+p word(S/). Furthermore, 
word( S1; S2) = word(S1) - word( S2). Hence, word(.$1; S2) “+p word(S1; S2). 
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— Case $1; $5 with S1v and $5 >y S^: in the case $1; $5 5r S5, where 
Siv and Sy +7 S}, by Lemma 1 and since S,v , we have word($;; $5) = 
word( S2). Thus, by induction hypothesis we have word(.$; $2) +p word(,95). 

— Case uX.T: if uX.T =r S', then [uX.T/ X|T +7 S'. Also unr(S) >r 
S' and, by induction hypothesis, word(unr(S)) +p word(S"). Hence, by 
definition of word, word(S) = X —+p word(S"). 

For the reverse implication, we prove that any transition in the grammar leads 
to a transition in the corresponding types. 

— if word(S) iu X, then word(5) = Y - X, where Y E e, and so unr(S) = 
#B;T and thus S Em T: 

— if word(S) S p X, then word(unr(S)) — Y, where Y n X. Hence, unr(S) 
is of the form *(£, : Uj}jes;T with 4; = ¢ and X= word(U;; T), for some 
j € J. Using the LTS we conclude that S a e Us 


Lemma 3. /f word S +p X, then exists T s.t. S — 4 T and X = word T. 


Proof. By induction on the definition of word. 


'The main result of this subsection follows from Lemmas 2 and 3. 


Theorem 1. Let (X,P') = grm(S,0) and (Y,P) = grm(T,P'). Then, grm is 
a full abstract encoding, i.e., S ~r T if and only if X Y. 


Proof. For the direct implication, assume that S ~y T and let B be a bisimula- 
tion for S and T. Then, consider B’ = ((word(So), word(To)) | (So, To) € B}. Ob- 
viously, (word(S), word(T)) € B’. To prove that B’ is a bisimulation, one assumes 
that word(Sg) +p X and proves that there exists Y such that word(To) —>p Y 
with (X ; Y) € B’. This proof is done by coinduction on the definition of word, 
uses Lemmas 2, 3, and the definition of B'. 

For the reverse implication, assume that X cp Y, with X — word(S) and 
Y- word(T) and let B' be a bisimulation for X and Y. Then, consider B — 
((So, To) | (word(So), word(To)) € B’}. Notice that (S, T) € B. The proof that 
B is a bisimulation, consists in showing that: given (So,To) € B, such that 
Sy —>7 S), there exists T/ such that Ty +7 Tj and (S), T.) € B. The proof 
follows by rule coinduction on the LTS and uses Lemmas 2 and 3. 


Now we sketch the proof that pruning grammars also preserves bisimulation. 
We distinguish the grammars in the context through the subscript of ~. 


Theorem 2. X ~p Y if and only if X ~prune(P) Y. 
Proof. For the direct implication, the bisimulation for X and Y over P is also a 
bisimulation for X and Y over prune(P). For the reverse implication, if B' is a 
bisimulation for X and Y over prune(P), then B = B' U((VW, VWZ) | (W > 
VWZ) € P, W unnormed} is a bisimulation for X and Y over P. 
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Correctness of the algorithm. We now focus on the correctness of the function 
bisimG. Before proceeding to soundness, we recall the safeness property intro- 
duced by Janéar and Moller [18]. 


Lemma 4 (Safeness Property). Given a set of productions P, X ~p ¥ if 
and only if the expansion tree rooted at {(X,Y)} has a successful branch. 


Notice that function bisimG builds an expansion tree by alternating between 
simplification—teflexive, congruence, and BPA—and expansion operations, as 
proposed by Jancar and Moller. These simplification rules are safe [18], in the 
sense that the application of any rule preserves the bisimulation from a parent 
node to at least one child node and, reciprocally, that bisimulation on a child 
node implies the bisimulation of its parent node. 

While the safeness property is instrumental in proving soundness, the finite 
witness property is of utmost importance to prove completeness. This result fol- 
lows immediately from the analysis by Janéar and Moller [18], which capitalizes 
on results by Caucal [5], and Christensen, Hüttel, and Stirling [6]: 


Lemma 5 (Finite Witness Property). Given a set of productions P, if 
X ~p Y then the expansion tree rooted at ((X , Y )) has a finite successful branch. 


We refer to Caucal, Christensen, Hüttel, and Stirling for details on the proof 
of existence of a finite witness, as stated in Lemma 5. This proof is particularly 
interesting in that it highlights the importance of the BPA rules and of pruning 
productions on reaching such (finite) witness. The results in these two papers also 
elucidate the reason for the distinction, in the simplification phase, between the 
cases where all the symbols in the grammar are and are not normed (cf. program 
variable rules in function expand). The safeness and finite witness properties 
ensure the termination of the algorithm, its soundness and completeness. 


Lemma 6 (Termination). Let (X,P’) = grm(S,0) and (Y,P) = grm(T, P’). 
Then, the computation of bisimG(X, Y , prune(P)) always terminates. 


Proof. Start by noticing that prune(P) always terminates. For bisimG itself, if 
S ~r T then, by Theorems 1 and 2, we have word(S) ~prunecp) word(T) and 
thus the existence of a finite successful branch is ensured by the finite witness 
property (Lemma 5). Hence, breadth-first search eventually terminates. 

When S 4,7 T, we easily conclude that all branches in the expansion tree 
are finite and thus bisimG( X , Y) terminates. To conclude that all branches are 
finite, observe that any infinite branch is successful by definition and thus the 
safeness property would imply word(S) ~punecp) word(T) and we would have 
S ~r T, by Theorems 1 and 2. 
Lemma 7. Let (X q^ = grm(S, 0) and (Y,P) = grm(T, P’). If bisimG(X, Y, 


= — 


prune(P)) returns True, then X ~prune(p) Y. 


Proof. Function bisimG returns True whenever it reaches a (finite) successful 
branch in the expansion tree rooted at {(X,Y)}, i.e., a branch terminating in 
an empty node. Conclude with the safeness property, Lemma 4. 
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From the previous results, the soundness of our algorithm is now immediate: 
the algorithm to check the bisimulation of context-free session types is sound 
with respect to the meta-theory of context-free session types. 


Theorem 3 (Soundness). Let (X,P’) = grm(S,0) and (Y, P) = grm(T, P’). 
If bisimG( X, Y, prune(P)) returns True then S ~r T. 


Proof. From Theorem 1, Theorem 2, and Lemma 7. 


Given that the algorithm terminates (Lemma 6), we know that if S 47 T, 
then bisimG(X, Y, prune(P)) returns False, where (X,P’) = grm(S,0) and 
(Y,P) = grm(T, P). We now show that the algorithm to check the bisimu- 
lation of context-free session types is complete with respect to the meta-theory 
of context-free session types. The finite witness property is paramount to achieve 
this result. 


Theorem 4 (Completeness). Let (X,P’) = grm(S,0) and (Y,P) = 


grm(T, P’). If S ~r T then bisimG(X, Y , prune(P)) returns True. 


Proof. Assume S ~r T. By Theorems 1 and 2, we have x ~prune(P) Y. Hence, 
Lemma 5 ensures the existence of a finite successful branch on the expansion 
tree rooted at (GG Y) ie., a branch terminating in an empty node. Since 
our algorithm traverses the expansion tree using breadth-first search it will, 
eventually, reach the empty node and conclude the bisimulation positively. 


Theorem 4 ensures that if bisimG( X, Y,P) returns False then S ^» T. 


5 Evaluation 


This section discusses the behaviour of our algorithm in the real world. Both for 
testing and for performance evaluation, we require test suites. We started with 
a carefully crafted, manually produced, suite of valid and invalid tests. This test 
suite was assembled by gathering pairs of types that emerged from examples 
we have studied and from programs we have written in FreeST, a programming 
language with context-free session types [2]. The tests produced by this method 
are, on the one hand, small, and, on the other hand, lacking diversity. 

We then turned our attention to the automatic generation of test cases. Pro- 
ducing pairs of arbitrary (well-formed) types that share no variables is simple. 
However, the probability that a randomly generated pair of types turns out to be 
bisimilar is extremely low. For this reason, we generate arbitrary pairs of types 
that are bisimilar by construction. Theorem 5 naturally induces an algorithm: 
given a natural number n (the size of the pair), arbitrarily select for the base 
case (n — 0) one of the pairs in item 1 of the theorem and for the recursive case 
(n > 1) one of the pairs in 2-12 items. 
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Theorem 5 (Properties of type bisimilarity). 


skip ~y skip and {B ~r EB; 

S;T —rU;V ifS ~7 U and T ~7 V; 

Ex.S ODT LX.T ifs OLE: phe 

x{li: Sihier T *{li: Tihier if (Si T Ti)ier: 

S ~r T;skip and S ~7 skip; T if S ~7 T; 

x(£&: Si ier U cr fli: Ta V her if (Si ~r Ti)ier and U ~r V; 
ToTSifS-rT; 

R;(S;T)-r(U;V;W if Ror U, S ~r V, andT TW; 
pX.uY.8 ~r eX X/Y]T ~r pY.[Y/X]T if 8 e T; 

10. X.S ^r T if S ~r T and X € free(S); 

11. |U/X]S ~r |VIX]T if 8 ~r T and U ~r V; 

12. uX.8 ~r [nX-T/X]T if Say T. 


SS MRNA os fo doh 


Proof. 1-3: Bisimulation is a congruence. 4-12: Thiemann and Vasconcelos [26] 
exhibit the appropriate bisimulations. 


For evaluating the algorithm on non-bisimilar pairs we add the following 
five anti-axioms to the list in Theorem 5: (1) skip 47 #B; (2) ?B 47!B; (3 
skip T *(£;: Sijier; (4) 604: Siier vr &(i: Sien (9) *(£i: Sihier vr 
x14;: Sj} where I C J. We generate two types using the same methodology as 
for the positive case and, then, discard the data collected when the pair turns 
out to be bisimilar. This produces pairs of types that are much closer than those 
obtained by random generation, thus hopefully approaching the reality that the 
compilers face when in production. 

We used QuickCheck [7] to generate two test suites. That for bisimilar pairs 
is constructed based on Theorem 5, whereas the construction of non-bisimilar 
tests relies on Theorem 5 plus the anti-axioms above. Both test suites comprise 
2000 entries, featuring types with a number of nodes (in the syntax tree) ranging 
from 1 to 200. 

The base algorithm described in the previous section turns out to behave 
quite poorly. We then implemented the following variants. 


1. Eliminating redundant productions in the grammar. Since the size of the 
expansion tree depends, among other things, on the number of productions in 
the grammar, generating smaller grammars seems a promising optimisation. 
Rather than blindly adding a new production Y — Z to the grammar (in 
function word, Section 3), we look, in the set of productions, for a production 
W => X syntactically equal to the former, up to renaming of non-terminal 
symbols. In this case, we add no new production and return non-terminal W 
instead. To find W, we look for the least fixed-point of the transitions in the 
languages generated by Z and X and compare them. This optimisation does 
not compromise the results of soundness, completeness, nor termination. 

2. Using a filter rule that removes nodes with hopeless pairs. A filter rule ensures 
that nodes composed by pairs of types with different norms (if normed) are 
removed from the expansion tree, since these types are not bisimilar. The 
filter rule preserves the results of soundness, completeness, and termination. 
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(c) Runtime of B1 per number of nodes 


Fig. 2: Results on the test suite composed by bisimilar pairs of types is represented 
in blue and the test suite with non-bisimilar pairs is represented in orange. Time is in 
milliseconds. Scales of 2a and 2c are logarithmic; scale of 2b is linear. 


3. Using a double-ended queue to prepend promising children. A double-ended 
queue allows prioritizing nodes with potential to reach an empty node faster. 


the optimisations and their combinations. We evaluate each variant 1-3 individ- 
ually (denoted by B1-B3) and all their combinations. For instance, B12 denotes 


The algorithm prepends (rather than appends) empty nodes or nodes whose 
pairs (X,Y) are such that |X| < 1 and |Y| < 1. This procedure does not 


compromise soundness, completeness, nor termination because the number of 


terminal symbols is finite and the algorithm takes advantage of the reflexive 


and congruence rules to remove previously visited nodes from the queue. 


To better understand how the algorithm performs in practice, we tested all 


the variant obtained from combining optimisations 1 and 2 above. B stands for 
the base algorithm, bisimT. We implemented the base algorithm and its variants 
in Haskell, using the Glasgow Haskell Compiler (version 8.6.5). The evaluation 
was conducted on a machine with an Intel Core i7-6700K at 4.2GHz and 8 GB 
of RAM running Arch Linux; tests were run under a timeout of 2 minutes. 
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Figure 2a depicts the distribution of the execution times (in ms) for both 
test suites and all variants. We observe that the behavior of negatives tests is 
roughly the same in all variants. However, the execution time for the positive 
tests differ from variant to variant. These differences mainly depend on the 
trade-off between the computational effort required for each optimisation and 
the efficiency they bring to deciding the equivalence of grammars. We observe 
that including optimisation 1 improves the execution time, while the rest, in 
general, does not. The combination of optimizations has a positive impact on 
execution time, with the exception of the B23 variant, whose distribution is 
worse than the base case. 

Figure 2b shows the number of timeouts for each variant. The base case, B, 
has 146 positive tests whose execution time exceeds 2 minutes. The distribution 
of timeouts per variant exhibits a behavior that is consistent with that of runtime 
shown in Figure 2a. All combinations lead to a reduction in the number of 
timeouts, when compared to the base case. 

Variant B1, resulting from considering optimisation 1, performs better than 
all others, presenting a median of 1.4 milliseconds and 7 timeouts, both for the 
positive tests. By taking advantage of optimisation 1, the number of timeouts 
reduced by 95%. The remaining positive tests take, on average, 1863.38 ms to 
complete with the base algorithm and 195.68 ms with variant B1, resulting in 
an 8996 reduction in the execution time. This is the variant in production for 
the FreeST compiler [2]. 

'The distribution of the execution time of B1 against the size of the input types 
is depicted in Figure 2c. As expected, the execution time increases considerably 
with the number of nodes. Although we have carried out tests with a fairly large 
number of nodes in the abstract syntax trees, we remark that, when used in a 
compiler, the algorithm will mostly come across types with a reduced number of 
nodes. 


6 Conclusion 


Context-free session types are a promising tool to describe protocols in con- 
current programs. In order to be incorporated in programming languages and 
effectively used in compilers, a practical algorithm to decide bisimulation is called 
for. Taking advantage of a process algebra graph representation of types to de- 
cide bisimulation [12,13], we developed one such algorithm and proved it correct. 
The algorithm is incorporated in a compiler for a concurrent functional language 
equipped with context-free session types [2]. 

Possible extensions to this work include addressing higher-order session types. 
We also plan to extend the implementation of the algorithm to cope with context- 
free grammars in Greibach Normal Form that are not necessarily deterministic. 
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Abstract. We showed in a recent paper that, when verifying a modal 
u-calculus formula, the actions of the system under verification can be 
partitioned into sets of so-called weak and strong actions, depending on 
the combination of weak and strong modalities occurring in the formula. 
In a compositional verification setting, where the system consists of pro- 
cesses executing in parallel, this partition allows us to decide whether 
each individual process can be minimized for either divergence-preserving 
branching (if the process contains only weak actions) or strong (other- 
wise) bisimilarity, while preserving the truth value of the formula. In this 
paper, we refine this idea by devising a family of bisimilarity relations, 
named sharp bisimilarities, parameterized by the set of strong actions. 
We show that these relations have all the nice properties necessary to 
be used for compositional verification, in particular congruence and ad- 
equacy with the logic. We also illustrate their practical utility on several 
examples and case-studies, and report about our success in the RERS 
2019 model checking challenge. 


Keywords: Bisimulation - Concurrency - Model checking - Mu-calculus. 


1 Introduction 


This paper deals with the verification of action-based, branching-time temporal 
properties expressible in the modal p-calculus (L,) [31] on concurrent systems 
consisting of processes composed in parallel, usually described in languages with 
process algebraic flavour. À well-known problem is the state-space explosion that 
happens when the system state space exceeds the available computer memory. 

Compositional verification is a set of techniques and tools that have proven 
efficient to palliate state-space explosion in many case studies [18]. They may 
either focus on the construction of the state space reduced for some equivalence 
relation, such as compositional state space construction [24, 32,36, 43, 45-47], 
or on the decomposition of the full system verification into the verification of 
(expectedly smaller) subsystems, such as compositional reachability analysis [49, 
10], assume-guarantee reasoning [41], or partial model checking [1, 34]. 
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In this paper, we focus on property-dependent compositional state space 
construction, where the reduction to be applied to the system is obtained by 
analysing the property under verification. We will refine the approach of [37] 
which, given a formula y of L, to be verified, shows how to extract from ọ a 
maximal hiding set of actions and a reduction (minimization for either strong [40] 
or divergence-preserving? branching — divbranching for short — bisimilarity [20, 
23]) that preserves the truth value of y. The reduction is chosen according to 
whether o belongs to an L, fragment named B. which is adequate with div- 
branching bisimilarity. This fragment consists of L, restricted to weak modal- 
ities, which match actions preceded by (property-preserving) sequences of hid- 
den actions, as opposed to traditional strong modalities (o) «o and [a] po, which 
match only a single action satisfying a. If p belongs to L, then the system can 
be reduced for divbranching bisimilarity; otherwise, it can be reduced for strong 
bisimilarity, the weakest congruence preserving full L,,. We call this approach 
of [37] the mono-bisimulation approach. 

We refine the mono-bisimulation approach in [35], by handling the case of 
L,, formulas containing both strong and weak modalities. To do so, fragments 
named IRI A, extend LM with strong modalities matching only the ac- 
tions belonging to a given set A, of strong actions. This induces a partition of 
the parallel processes into those containing at least one strong action and those 
not containing any, so that a formula o € 2 (As) is still preserved if the pro- 
cesses containing strong actions are reduced for strong bisimilarity and the other 
ones for divbranching bisimilarity. We call this refined approach the combined 
bisimulations approach. Guidelines are also provided in [35] to extract a set of 
strong actions from particular L,, formulas encoding the operators of widely-used 
temporal logics, such as CTL [11], ACTL [39], PDL [15], and PDL-A [44]. This 
approach is implemented on top of the CADP verification toolbox [19], and ex- 
periments show that it can improve the capabilities of compositional verification 
on realistic case studies, possibly reducing state spaces by orders of magnitude. 

In this paper, we extend these results as follows: (1) We refine the approach 
by devising a family of new bisimilarity relations, called sharp bisimilarities, 
parameterized by the set of strong actions A,. They are hybrid between strong 
and divbranching bisimilarities, where strong actions are handled as in strong 
bisimilarity whereas weak actions are handled as in divbranching bisimilarity. 
(2) We show that each fragment L7?" (A,) is adequate with the corresponding 
sharp bisimilarity, namely, Lee A) is precisely the set of properties that 
are preserved by sharp bisimilarity (w.r.t. A,) on all systems. (3) We show 
that, similarly to strong and divbranching bisimilarities, every sharp bisimilarity 
is a congruence for parallel composition, which enables it to be used soundly 
in a compositional verification setting. (4) We define an efficient state space 


? In [18,37], the name divergence-sensitive is used instead of divergence-preserving 
branching bisimulation (or branching bisimulation with explicit divergences) [20, 
23]. This could lead to a confusion with the relation defined in [13], also called 
divergence-sensitive but slightly different from the former relation. To be consistent 
in notations, we replace by dbr the abbreviation dsbr used in earlier work. 
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reduction algorithm that preserves sharp bisimilarity and has the same worst- 
case complexity as divbranching minimization. Although it is not a minimization 
(i.e., sharp bisimilar states may remain distinguished in the reduced state space), 
it coincides with divbranching minimization whenever the process it is applied 
to does not contain strong actions, and with strong minimization in the worst 
case. Therefore, applying this reduction compositionally always yields state space 
reduction at least as good as [35], which itself is an improvement over [37]. (5) 
At last, we illustrate our approach on case studies and compare our new results 
with those of [35, 37]. We also report about our recent success in the RERS 2019 
challenge, which was obtained thanks to this new approach. 

The paper is organized as follows: Sections 2 and 3 introduce the neces- 
sary background about process descriptions and temporal logic. Section 4 de- 
fines sharp bisimilarity, states its adequacy with Tor (As), and its congruence 
property for parallel composition. Section 5 presents the reduction algorithm 
and shows that it is correct and efficient. Section 6 illustrates our new approach 
on the case studies. Section 7 discusses related work. Finally, Section 8 concludes 
and discusses research directions for the future. The proofs of all theorems pre- 
sented in this paper and a detailed description of how we tackled the RERS 2019 
challenge are available in a Zenodo archive.* 


2 Processes, Compositions, and Reductions 


We consider systems of processes whose behavioural semantics can be repre- 
sented using an LTS (Labelled Transition System). 


Definition 1 (LTS). Let A be an infinite set of actions including the invisible 
action T and visible actions A \ (T). An LTS P is a tuple (X, A, —, pinit), 
where X is a set of states, A C A is a set of actions, — C X x Ax X is the 
(labelled) transition relation, and pinit € X is the initial state. We may write 
Xp, Ap, — p for the sets of states, actions, and transitions of an LTS P, and 
init(P) for its initial state. We assume that P is finite and write |P|,, (resp. 
|P|s.) for the number of states (resp. transitions) of P. We write p > p' for 


(p,a, p’) € — and p & for (Ap! Xp,a€ A) p 2, p. 


LTS can be composed in parallel and their actions may be abstracted away 
using the parallel composition and action mapping defined below, of which action 
hiding, cut (also known as restriction), and renaming are particular cases. 


Definition 2 (Parallel composition of LTS). Let P,Q be LTS and Asyne C 
AM r3. The parallel composition of P and Q with synchronization on Async, writ- 
ten “P |[Asyne]| Q”, is defined as (Xp x Xo, Ap U Ag, —, (init(P), init(Q))), 
where (p,q) —> (p',q’) if and only if (1) p +p p', q' = q, anda € Async, or (2) 
p =p, q go q', and a € Async, or (3) p =p p,q o q', and a € Async- 
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Definition 3 (Action mapping). Let P be an LTS and p : Ap —^ 24 be a 
total function. We write p(Ap) for the image of p, defined by Uacap pla). We 
write p(P) for the LTS (Xp, p(Ap), —, init(P)) where — = ((p,a',p') | (da € 
Ap) p =p p' ^a! € p(a)}. An action mapping p is admissible if T € Ap implies 
p(t) — (r). We distinguish the following admissible action mappings: 


— p is an action hiding if (AA C AN (71) (Va € An Ap) pla) = {T} ^ (Vae 
Ap A) p(a) = {a}. We write “hide A in P" for p(P). 

— p is an action cut if (AA C A\ {r}) (Va e An Ap) pla) = A (Va € 
Ap \ A) pla) = {a}. We write “cut A in P" for p(P). 

— p is an action renaming if (Af : Ap > A) (Va € Ap) pla) = {f(a)} and 
T € Ap implies f(r) =T. We write “rename f in P" for p(P). 


Parallel composition and action mapping subsume all abstraction and compo- 
sition operators encodable as networks of LTS [42, 18, 33], such as synchroniza- 
tion vectors? and the parallel composition, hiding, renaming, and cut operators 
of CCS [38], CSP [8], mCRL [26], LOTOS [29], E-LOTOS [30], and LNT [9]. 


LTS can be compared and reduced modulo well-known bisimilarity relations, 
such as strong [40] and (div)branching [20, 23] bisimilarity. We do not give their 
definitions, which can easily be found elsewhere (e.g., [35]). They are special cases 
of Definition 7 (page 7), as shown by Theorem 1 (page 9). We write ~ (resp. 
^dbr) for the strong (resp. divbranching) bisimilarity relation between states. 
We write Minstr(P) (resp. minas, (P)) for the quotient of P w.r.t. strong (resp. 
divbranching) bisimilarity, i.e., the LTS obtained by replacing each state by its 
equivalence class. The quotient is the smallest LTS of its equivalence class, thus 
computing the quotient is called minimization. Moreover, these bisimilarities are 
congruences for parallel composition and admissible action mapping. This allows 
reductions to be applied at any intermediate step during LTS construction, thus 
potentially reducing the overall cost. However, since processes may constrain 
each other by synchronization, composing LTS pairwise following the algebraic 
structure of the composition expression and applying reduction after each com- 
position can be orders of magnitude less efficient than other strategies in terms 
of the largest intermediate LTS. Finding an optimal strategy is impossible, as 
it requires to know the size of (the reachable part of) an LTS product without 
actually computing the product. One generally relies on heuristics to select a 
subset of LTS to compose at each step of LTS construction. In this paper, we 
will use the smart reduction heuristic [12, 18], which is implemented within the 
SVL [17] tool of CADP [19]. This heuristic tries to find an efficient composition 
order by analysing the synchronization and hiding structure of the composition. 


5 For instance, the composition of P and Q where action a of P synchronizes with 
either b or c of Q, can be written as p(P) |[b, c]| Q, where p maps a onto {b,c}. This 
example illustrates the utility to map actions into sets of actions of arbitrary size. 
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3 Temporal Logics 


Definition 4 (Modal p-calculus [31]). The modal p-calculus (L,,) is built 
from action formulas a and state formulas p, whose syntax and semantics w.r.t. 
an LTS P = (X, A, —, pisa) are defined as follows: 


a u—a [a]lA = {a} 


false [false] 4 = 0 
Q1 V ag [o1 V ag]a = [aiJa U [o2]A 
O00 [7^oo]4 = A \ [ao] 
(p ::= false [false] pó = 0 
yiVy2 [91 V v2] Pd = [yi] Pd U ex»? 
^o [-¥o] P? = X X [vo] P? 
(a) o Ka) eo] p? = (p € X | 3p => p'a € [ala Ap’ € [vol p? } 
x [X] pd = &(X) 
[Ime [uX.po] P$ = Uk>0 Bop 5(0) 
where X € X are propositional variables denoting sets of states, 6: X — 27 
is a context mapping propositional variables to sets of states, |] is the empty 


context, d[U/X] is the context identical to 6 except for variable X, which is 
mapped to state set U, and the functional op; : 27 —, 2" associated to the 
formula uX.po is defined as Gy p (U) = [vo] P9|U/ X]. For closed formulas, we 
write P E- » (read P satisfies p) for pia € [vl Pl]. 


Action formulas o are built from actions and Boolean operators. State formulas 
y are built from Boolean operators, the possibility modality (o) vo denoting the 
states with an outgoing transition labelled by an action satisfying o and leading 
to a state satisfying vo, and the minimal fixed point operator uX.po denoting 
the least solution of the equation X = yo interpreted over 2”. 

'The usual derived operators are defined as follows: Boolean connectors true — 
—^false and 4 ^ p2 = —^(^«v1 V ^2); necessity modality [a] eo = ^(o) ^o; and 
maximal fixed point operator v.X.g = ^u X.—o[^X/ X], where go[^X/ X] is 
the syntactic substitution of X by ~X in yo. Syntactically, () and [] have the 
highest precedence, followed by ^, then V, and finally u and v. To have a well- 
defined semantics, state formulas are syntactically monotonic [31], i.e., in every 
subformula L.X «0, all occurrences of X in yo fall in the scope of an even number 
of negations. Thus, negations can be eliminated by downward propagation. We 
now introduce the weak modalities of the fragment LU proposed in [37]. 


Definition 5 (Modalities of d [37]). We write a, for an action formula 
such that T € [o«]A4 and aa for an action formula such that T € |o«]A. We con- 
sider the following modalities, their L, semantics, and their informal semantics: 


modality name notation L, semantics 
ultra-weak (p1?.a7)*) p2 LX.» V (y1 A (os) X) 
weak ((q1?.0.)* 4917.04) Yo | LX 41 ^ ((Aa) p2 V (o) X) 
weak infinite looping (17.04) @ vX.pı ^ (o4) X 
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Ultra-weak: p is source of a path whose transition labels satisfy a,, leading to 
a state that satisfies pə, while traversing only states that satisfy p1. 

Weak: p is source of a path whose transition labels satisfy a+, leading to a state 
that satisfies y and (aq) p2, while traversing only states that satisfy p1. 
Weak infinite looping: p is source of an infinite path whose transition labels 

satisfy a,, while traversing only states that satisfy p1. 


We also consider the three dual modalities [(y1?.a,)*] p2 = 7((~1?-a7)*) ^v», 
[(y1?.a7)*.p1? aa] Q2 = a((y1?.a7)*.p1?.Qa) P2, [o1?.o4] 4= —5(q1?.04) Q. 
The fragment pabr adequate with divbranching bisimilarity consists of L, from 
which the modalities (a) p and [a] y are replaced by the ultra-weak, weak, and 
weak infinite looping modalities defined above. 

We identify fragments of L, parameterized by a set of strong actions As, as 
the set of state formulas whose action formulas contained in strong modalities 
satisfy only actions of As. 


Definition 6 (L9 (Aa) fragment of L, [35]). Let As C A be a set of 
actions called strong actions and as be any action formula such that [os] A C As, 
called a strong action formula. Leen As) is defined as the set of formulas 
semantically equivalent to some formula of the following language: 


q ::= false | p1 V p2 | ^o | (as) Yo | X | uX. po 
| ((917.04)*) p2 | (41?.07)* 4917.0) v2 | (917.0) € 


In the context of Lz'9"9(A,), we call (as) po a strong modality. 


In [35], we also provide guidelines for extracting a set A, from particular 
L, formulas encoding the operators of widely-used temporal logics, such as 
CTL [11], ACTL [39], PDL [15], and PDL-A [44]. 


Example 1. The PDL formula [true*.a;.a?|]true belongs to L7'?"?((ag]) as it 
is semantically equivalent to [(true?.true)*.true?.a;] [a2] true. The CTL formula 
EF((a1) true (a2) true) belongs both to L7/7?"9 ({a;}) as it is semantically equiv- 
alent to ((true?.true)*) (((a1) true?.true)*.(a1) true?.a2) true and to L7/7?"*((a5]) 
as it is semantically equivalent to the same formula where a, and a2 are swapped. 
These formulas do not belong to L77°"9(Q). (This was shown in [35].) 


The latter example shows that to a formula y may correspond several mini- 
mal sets of strong actions As. Indeed, either the (a1) true or the (a2) true modality 
can be made part of a weak modality, but not both in the same formula. 


4 Sharp Bisimilarity 


We define the family of sharp bisimilarity relations below. Each relation is hy- 
brid between strong and divbranching bisimilarities, parameterized by the set 
of strong actions, such that the conditions of strong bisimilarity apply to strong 
actions and the conditions of divbranching bisimilarity apply to all other actions. 


6 For generality we allow T € As, to enable strong modalities of the form (ar) po. 
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Definition 7 (Sharp bisimilarity). A divergence-unpreserving sharp bisimu- 
lation w.r.t. a set of actions A, is a symmetric relation RC X x X such that if 
(p,q) € R then for all p “+ p', there exists q' such that (p',q’) € R and either of 
the following hold: (1) q > q', or (2)a=7,7 € As, and q' =q, or (3) a ¢ As, 
and there exists a sequence of transitions qo —> ... —> qn —> q' (n > 0) such 
that qo = q, and for alli € 1..n, (p, qj) € R.* A sharp bisimulation R additionally 
satisfies the following divergence-preservation condition: for all (po, qo) € R such 
that po —> py —> po —> ... with (pi,qo) € R for all i > 0, there is also an 
infinite sequence qo —> qı —9 q2 —9 ... such that (pi, qj) € R for all i, j > 0. 
Two states p and q are sharp bisimilar w.r.t. As, written p cA, q, if and only 
if there exists a sharp bisimulation R w.r.t. As such that (p,q) € R. 


Similarly to strong, branching, and divbranching bisimilarities, sharp bisimi- 
larity is an equivalence relation as it is the union of all sharp bisimulations. The 
quotient of an LTS P w.r.t. sharp bisimilarity is unique and minimal both in 
number of states and number of transitions. 


Example 2. Let a,b,w € A\ {T}, T,w ¢ As. LTS P; and P; of Figure 1 satisfy 
P; ~ya, P! (i € 1.7). We give the smallest relation between P; and P7, whose 
symmetric closure is a sharp bisimulation w.r.t. A, and the weakest condition 
for P; to be minimal. Unlike divbranching, states on the same r-cycle are not 


necessarily sharp bisimilar: in P7, if a € A, then pọ and p are not sharp bisimilar. 


Example 3. The LTS of Figure 2(a) is equivalent for ~4,} to the one of Fig- 
ure 2(b), which is minimal. We see that sharp bisimilarity reduces more than 
strong bisimilarity when at least one action (visible or invisible) is weak. Here, + 
is the only weak action and the minimized LTS is smaller than the one minimal 
for strong bisimilarity (only pı and p» are strongly bisimilar). 


If 7 € Ag, then case (2) of Definition 7 cannot apply, i.e., 7-transitions cannot 
be totally suppressed. As a consequence, looking at case (3), if r-transitions are 
present in state qo then, due to symmetry, they must have a counterpart in 
state p. As a result, finite sequences of 7-transitions are preserved. Sharp may 
however differ from strong bisimilarity in the possibility to compress circuits of 
T-transitions that would remain unreduced, as illustrated in Example 4 below. 


Example 4. If 7 € A, and a ¢ As, then the LTS of Figure 2(b) (which is minimal 
for strong bisimilarity) can be reduced to the LTS of Figure 2(c). 


Next theorems are new. Theorem 1 expresses that sharp bisimilarity w.r.t. a 
set of strong actions A, is strictly stronger than w.r.t. any set of strong actions 
strictly included in A,. Unsurprisingly, it also establishes that sharp coincides 
with divbranching when the set of strong actions is empty, and with strong when 


T We require that (p, qi) € R for all i € 1..n and not the simpler condition (p, qn) € R 
(as usual when defining branching bisimulation) because sharp bisimulation has not 
the nice property that (p,qo) € R and (p,qn) € R imply (p,qi) € R for all i € 1..n. 
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AaATADATAWATATE As Aw ¢ As implies 


Py: 


P: 


Pi: 


Pa: 


Ps: 


Ds: 


Po: 


Po pi 


a 
 —— pi 


i 


po ——- pi 


po—— pı 


“HAs 


“HAs 


“HAs 


“HAs 


“HAs 


“HAs 


“HAs 


Pi: 


Ps: 


* w $ 
Po —> Pı 


{(po, po), (p1, P1), (P2, Po), (vs, po)} 
Pi minimal 


{(Po, Po), (P1, P1), (P2, Po), (ps. Ps) 
a € A, implies P} minimal 


{(p0; po), (P1; P1), (P2; Po)} 
a X w implies Pj minimal 


{(po, Po), (px. P1), (pa. Po), (ps; Po) } 
a Æ w implies P; minimal 


{(p0, po); (px, P1), (P2, Po), (ps. P3)} 
b#AaADE As implies P: minimal 


{(po, Po); (px, P1), (P2, Po); (P3, Po) } 
P minimal 


{(Po, po); (p1, P1), (pa, P2), (ps. Po) } 
a € A, implies P} minimal 


Fig. 1. Examples of sharp bisimilar LTS 
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po ——- pi ps —> pr po ——- pi po —— p 
" T a 
p2 pa pa po ph 
I2 NN 


(a) (b) (c) 
Fig. 2. LTS of Examples 3 and 4 


it comprises all actions (including 7). It follows that the set of sharp bisimilarity 
relations equipped with set inclusion forms a complete lattice whose supremum 
is divbranching bisimilarity and whose infimum is strong bisimilarity. 


'Theorem 1. (1) go =~dbr (2) CLA m 0 (3) if A. C As then “tA, CHA: 


Theorem 2 expresses that sharp bisimilarity w.r.t. A, preserves the truth 
value of all formulas of L77?"9(A,), and Theorem 3 that two LTS verifying 
exactly the same formulas of L#’°"9(A,) are sharp bisimilar. We can then deduce 
that pend (As) is adequate with —44,, as expressed by Corollary 1. 


Theorem 2. If P 44, P' and o € Li?"9(A,) then PE o iff P'E v. 
Theorem 3. If (Vp € L77"9(A,)) PE v iff QE v, then P ^44, Q. 


Corollary 1. pene A) is adequate with ~z,,, i.e., P ~ga, P' if and only if 
(Ve € Lz7"*(A,)) P E e iff P' E v. 


Theorems 4 and 5 express that sharp bisimilarity is a congruence for parallel 
composition and admissible action mapping. It follows that it is also a congruence 
for hide, cut, and rename, as expressed by Corollary 2. 


Theorem 4. If P 44, P', Q ~ga, Q' then P |[Async]| Q ~ga, P' |[Async]| Q. 


Theorem 5. If p is admissible and P -A, P', then p(P) «x, p(P'), where 
AL = p(s) \ plAp \ As). 


Corollary 2. We write A, for AU {T}. If P ^A, P' then: 


— cut A in P ^44, cut A in P' 
— hide A in P ^44, hide A in P' if A, C A, V A-N A, =O 
— rename f in P ^44, rename f in P' if f(As) C AA f(ApN A.) A, — 0 


'These theorems and corollaries generalize results on strong and divbranching 
bisimilarity. In particular, the side conditions of Corollary 2 are always true when 
A, = 0 (divbranching) or A, = A (strong). 

Since every admissible network of LTS can be translated into an equivalent 
composition expression consisting of parallel compositions and admissible action 
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mappings, Theorems 4 and 5 imply some congruence property at the level of 
networks of LTS. However, one must be careful on how the synchronization 
rules preserve or modify the set of strong actions of components. 

In the sequel, we establish formally the relationship between sharp bisimilar- 
ity and sharp 7-confluence, a strong form of r-confluence [27] defined below in a 
way analogous to strong 7-confluence in [28]. It is known that every r-transition 
that is r-confluent is inert for branching bisimilarity, i.e., its source and target 
states are branching bisimilar. There are situations where r-confluence can be 
detected locally, thus enabling on-the-fly LTS reductions. We present an analo- 
gous result that might have similar applications, namely, every 7-transition that 
is sharp T-confluent is inert for (divergence-unpreserving) sharp bisimilarity. 


Definition 8 (Sharp 7-confluence). Let P = (X, A, —>, Pint) and T C ++ 
be a set of internal transitions. T is sharp T-confluent w.r.t. a set A, of strong 
actions if T ¢ A, and for all (po,T,p1) € T, a € A, and p € X: (1) po = pa 
implies either py —> pa or there exists p such that py —> ps and (p2, T, p3) € T, 
and (2) if a € A, then pı — ps implies either po + pa or there exists pọ such 
that py —> pa and (po,T,pa) € T. A transition pg —> pi is sharp T-confluent 
w.r.t. As if there is a set of transitions T that is sharp T-confluent w.r.t. As and 
such that (po, T, pı) € T. 


The difference between strong r-confluence and sharp r-confluence is the ad- 
dition of condition (2), which can be removed to obtain the very same definition 
of strong 7-confluence as [28]. Strong 7-confluence thus coincides with sharp 7- 
confluence w.r.t. the empty set of actions. Sharp 7-confluence not only requires 
that other transitions of the source state of a confluent transition also exist in 
the target state, but also that the converse is true for strong actions. 

If a transition is sharp r-confluent w.r.t. As, then it is also sharp 7-confluent 
w.r.t. any subset of A,. In particular, sharp 7-confluence is stronger than strong 
T-confluence (which is itself stronger than 7-confluence). Theorem 6 formalizes 
the relationship between sharp r-confluence and divergence-unpreserving sharp 
bisimilarity. This result could be lifted to sharp bisimilarity by adding a condition 
on divergence in the definition of sharp r-confluence. 


Theorem 6. If 7 ¢ A, and po —+p pi is sharp T-confluent w.r.t. As, then po 
and pı are divergence-unpreserving sharp bisimilar w.r.t. As. 


Theorem 6 illustrates a form of reduction that one can expect using sharp 
bisimilarity when T ¢ As, namely compression of diamonds of sharp r-confluent 
transitions, which are usually generated by parallel composition. The strongest 
form of sharp r-confluence (which could be called ultra-strong r-confluence) is 
when all visible actions are strong. In that case, every visible action present in 
the source state must be also present in the target state, and conversely. The 
source and target states are then sharp bisimilar w.r.t. the set of visible actions. 
Yet, it is interesting to note that they are not necessarily strongly bisimilar, 
sharp bisimilarity w.r.t. all visible actions being weaker than strong bisimilarity. 
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There exist weaker forms of 7-confluence [27,50], which accept that choices 
between r-confluent and other transitions are closed by arbitrary sequences of 
T-confluent transitions rather than sequences of length 0 or 1. It could be in- 
teresting to investigate how the definition of sharp r-confluence could also be 
weakened, while preserving inertness for sharp bisimilarity. 


5 LTS Reduction 


'The interest of sharp bisimilarity in the context of compositional verification is 
the ability to replace components by smaller but still equivalent ones, as allowed 
by the congruence property. To do so, we need a procedure that enables such a 
reduction. This is what we address in this section. 

A procedure to reduce an LTS P for sharp bisimilarity is proposed as follows: 
(1) Build P’, consisting of P in which all 7-transitions that immediately precede 
a transition labelled by a strong action (or all 7-transitions if 7 is itself a strong 
action) are renamed into a special visible action & € A\ Ap; (2) Minimize P' for 
divbranching bisimilarity; (3) Hide in the resulting LTS all occurrences of x. The 
renaming of 7-transitions into & allows them to be considered temporarily as vis- 
ible transitions, so that they are not eliminated by divbranching minimization.? 
This algorithm is now defined formally. 


Definition 9. Let P be an LTS and A, be a set of strong actions. Let k € A\ Ap 
be a special visible action. We write reda (P) for the reduction of P defined as 
the LTS "hide k in minqy(P')", where P' = (Ep, Ap U {Kk}, —, init(P)) and 
—— is defined as follows: 


— = {(p, sp") |p >p p^ g(a, p')} U {(p, 4, p") | p >p p! A >8(a, p')} 
where w(a,p!) = ((a= 7) ^ (r € AsV p ep) 


It is clear that red 4, (P) is a reduction, i.e., it cannot have more states and 
transitions than P. Since the complexities of the transformation from P to P’ 
and of hiding & are at worst linear in |P|;,., the complexity of the whole algorithm 
is dominated by divbranching minimization, for which there exists an algorithm? 
of worst-case complexity O(m log n), where m = |P|ir and n = |P|., [25]. 

As regards correctness, Theorem 7 states that red 4, (P) is indeed sharp bisim- 
ilar to P. Theorem 8 indicates that the reduction coincides with divbranching 
minimization if the LTS does not contain any strong action, with strong min- 
imization if T is a strong action or if the LTS does not contain 7, and that 
the resulting LTS has a size that lies in between the size of the minimal LTS for 
divbranching bisimilarity and the size of the minimal LTS for strong bisimilarity. 


Theorem 7. For any LTS P, we have P ^44, red A, (P). 


8 The letter & stands for keep uncompressed. 
? Strictly speaking, the algorithm of [25] implements branching minimization but, as 
noted by its authors, handling divergences requires only a minor adaptation. 
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Theorem 8. The following hold for any LTS P: (1) if Ap QO A, = 0 then 
red 4. (P) = minayr(P), (2) ifr € Ap \ As then red4,(P) = min; (P), and (3) 
|minaor(P)| st < |reda,(P)| st < |minser(P)| st ^ | min dor(P)|tr < [red a, (P) ltr < 
|minstr(P)|tr- 


Although sharp reduction is effective in practice, as will be illustrated in 
the next section, it may fail to compress 7-transitions that are inert for sharp 
bisimilarity, as show the following examples. 


Example 5. Consider the LTS of Figure 2(a) (page 9). Its reduction using the 
above algorithm consists of the three steps depicted below: 


po — pı p —- pro po——-pi——pà p—— r —— m 
. S A w 
T T K T T z a T a 
Pa m c ue Pa Do D3 D3 
1. r-to-& renaming 2. Divbranching min. 3. «-hiding 


The reduced LTS (obtained at step 3) has one more state and two more 
transitions than the minimal LTS shown in Figure 2(b). Even though all visible 
actions are strong, our reduction compresses more than strong bisimilarity (recall 
that the minimal LTS for strong bisimilarity has 7 states and 8 transitions). In 
general, our reduction reduces more than strong bisimilarity!? as soon as T € A, 
(which is the case for most formulas in practice). 


Example 6. In Figure 1 (page 8), if a € A, then red A, (P1) = Pi, red A, (P5) = P5, 
and red4,(Ps) = P$, i.e., reduction yields the minimal LTS. Yet, red A, (P3) = 
Ps # Pj, i.e., the sharp r-confluent transition po —9 p, p2 is not compressed. 
Similarly, Py, P5, and P; are not minimized using red 4, . 


Devising a minimization algorithm for sharp bisimilarity is left for future 
work. It could combine elements of existing partition-refinement algorithms for 
strong and divbranching minimizations, but the following difficulty must be 
taken into account (basic knowledge about partition-refinement is assumed): 


— A sequence of 7-transitions is inert w.r.t. the current state partition if both 
its source, target, and intermediate states are in the same block. To refine a 
partition for sharp bisimilarity, one must be able to compute efficiently the 
set of non-inert transitions labelled by weak actions and reachable after an 
arbitrary sequence of inert transitions. The potential presence of inert cycles 
has to be considered carefully to avoid useless computations. 


10 The result of reduction is necessarily strong-bisimulation minimal, because if a tran- 
sition p —> p' is renamed into &, then it is also the case of a 7-transition in every 
state bisimilar to p, which remains bisimilar after the renaming. In addition, the sub- 
sequent divbranching minimization step necessarily merges strongly bisimilar states. 
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— In the case of divbranching bisimilarity, every r-cycle is inert and can thus be 
compressed into a single state. This is usually done initially, using the Tarjan 
algorithm for finding strongly connected components, whose complexity is 
linear in the LTS size. This guarantees the absence of inert cycles (except 
self t-loops) all along the subsequent partition-refinement steps. However, 
T-Cycles are not necessarily inert for sharp bisimilarity, as illustrated by LTS 
Pt in Figure 1 (page 8). Therefore, 7-cycles cannot be compressed initially. 
Instead, a cycle inert w.r.t. the current partition may be split into several 
sub-blocks during a refinement step. To know whether the sub-blocks still 
contain inert cycles, the Tarjan algorithm may have to be applied again. 


Although red4, is not a minimization, we will see that it performs very well 
when used in a compositional setting. The reason is that (1) only a few of the 
system actions are strong, which limits the number of r-transitions renamed to &, 
and (2) sharp 7-confluent transitions most often originate from the interleaving of 
T-transitions that are inert in the components of parallel composition. The above 
reduction algorithm removes most inert transitions in individual (sequential) 
LTS, thus limiting the number of sharp r-confluent transitions in intermediate 
LTS. Still, better reductions can be expected with a full minimization algorithm, 
which will compress all 7-transitions that are inert for sharp bisimilarity. 


6 Experimentation 


We experimented sharp reduction on the examples presented in [35] (consisting 
of formulas containing both weak and strong modalities), namely the TFTP 
(Trivial File Transfer Protocol) and the CTL verification problems on parallel 
systems of the RERS 2018 challenge. For lack of space, see [35] for more details 
about these case studies. In both cases, we composed parallel processes in the 
same order as we did using the combined bisimulations approach, but using sharp 
bisimilarity instead of strong or divbranching bisimilarity to reduce processes. 
Experiments were done on a 3GHz/12GB RAM/8-core Intel Xeon computer 
running Linux, using the specification languages and 32-bit versions of tools 
provided in the CADP toolbox version 2019-d “Pisa” [19]. 

The results are given in Figures 3 (TFTP) and 4 (RERS 2018), both in 
terms of the size of the largest intermediate LTS, the size of the final LTS (LTS 
obtained after the last reduction step, on which the formula is checked), memory 
consumption, and time. Each subfigure contains three curves corresponding to 
the mono-bisimulation approach (using strong bisimulation to reduce all LTS), 
the combined bisimulations approach, and the sharp bisimulation approach. The 
former two curves are made from data that were already presented in [35]. Note 
that the vertical axis of all subfigures is on a logarithmic scale. In the RERS 2018 
case, the mono-bisimulation approach gives results only for experiments 1017222 
and 101#23, all other experiments failing due to state space explosion. +t 


11 E.g., smart mono-bisimulation fails on problem 103723 after generating an inter- 
mediate LTS with more than 4.5 billion states and 36 billion transitions (instead of 
50, 301 states and 334, 530 transitions using sharp bisimulation) using Grid'5000 [6]. 


70 F. Lang et al. 


Largest LTS size 


Kstates 
TT 


123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 


Final LTS size 


Kstates 
s55 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 


Memory peak 
E 10000 
5S 1000 
2 
c 100 
D 
$ v 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
Verification time 
10000 
2) 
no] 1000 
c 
8 10-4 - 
o 
wn 10 
123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
Experiment number 
—— Mono-bisimulation — Combined bisimulations — Sharp bisimulation 
Fig. 3. Experimental results of the TF'TP case-study 
Largest LTS size Final LTS size 
100000 100000 
10000 10000 
« 1000 w 1000 
o o 
S 100 I: 100 — 
2 10 32 10 
1 1 
0.1 0.1 
101422 101423 102422 102423 103421 103422 103523 101422 101423 102422 102423 103821 103422 103423 
Memory peak Verification time 
10000 10000 
2 1000 
o 
S 9 1000 
a 10 c 
S 8 
D 
© 10 p we 
Ss [7] 
1 10 
101422 101423 102422 1024823 103821 103422 103423 101422 101423 102422 102423 103421 103422 103423 
Experiment number Experiment number 


—— Mono-bisimulation — Combined bisimulations — Sharp bisimulation 


Fig. 4. Experimental results of the RERS 2018 case-study 


Sharp Congruences for Logics Combining Weak and Strong Modalities 71 


These results show that sharp bisimilarity incurs much more LTS reduction 
than the combined bisimulations approach, by a factor close to the one obtained 
when switching from the mono-bisimulation approach to the combined bisimu- 
lations approach. However, in the case of the RERS 2018 examples, this gain 
on LTS size does not always apply to time and/or memory consumption in the 
same proportions, except for experiment 1034222. This suggests that our imple- 
mentation of minimization could be improved. 

These experiments were conducted after closing of the RERS 2018 challenge. 
Encouraged by the good results obtained with these two approaches, we partic- 
ipated to the 2019 edition?, where 180 CTL problems were proposed instead of 
9 in 2018. The models on which the properties had to be verified have from 8 to 
70 parallel processes and from 29 to 234 actions. Although the models had been 
given in a wealth of different input formats (communicating automata, Petri 
nets in PNML format with NUPN information [16], and Promela) suitable for 
a large number of model checking tools, no other team than ours participated 
to the parallel challenges. This is a significant difference with 2018, when the 
challenge was easier, allowing three teams (with different tools) to participate. 

We applied smart sharp reduction to these problems, using a prototype pro- 
gram that extracts strong actions automatically from (a restricted set of) CTL 
formulas used in the competition.!? This allowed the 180 properties to be checked 
automatically in less than 2.5 hours (CPU time), and using about 200 MB of 
RAM only, whereas using strong reduction failed on most of the largest problems. 
'The largest intermediate graph obtained for the whole set of problems has 3364 
states. All results were correct and we won all gold medals!^ in this category. 
Details are available in the Zenodo archive mentioned in the introduction. 


7 Related Work 


The paper [48] defines on doubly-labelled transition systems (mix between Kripke 
structure and LTS) a family of bisimilarity relations derived from divbranching 
bisimilarity, parameterized by a natural number n, which preserves CTL* formu- 
las whose nesting of next operators is smaller or equal to n. Similar to our work, 
they show that this family of relations (which is distinct from sharp bisimilarity 
in that there is no distinction between weak and strong actions) fills the gap 
between strong and divbranching bisimilarities. They apply their bisimilarity 
relation to slicing rather than compositional verification. 

The paper [2] proposes that, if the formula contains only so-called selec- 
tive modalities, of the form ((=a1)*.a2) po, then all actions but those satisfying 


1? http://rers-challenge.org/2019 

13 The paper [35] presents identities that were used to extract such strong actions. 

14 A RERS gold medal is not a ranking but an achievement, not weakened by the low 
number of competitors. We also won all gold medals in the “verification of LTL 
properties on parallel systems" category, using an adaptation of this approach. 

15 http://cadp.inria.fr/news12.html 
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0, Or a» can be hidden, and the resulting system can be reduced for T*.a- 
equivalence [14]. Yet, there exist formulas whose strong modalities (a) yo can- 
not translate into anything but the selective modality ((—true)*.a) v, meaning 
that no action at all can be hidden. In this case, 7*.a equivalence coincides with 
strong bisimilarity and thus incurs much less reduction than sharp bisimilarity. 
Moreover, it is well-known that 7*.a-equivalence is not a congruence for parallel 
composition [7], which makes it unsuitable to compositional verification, even to 
check formulas that contain weak modalities only. 

The adequacy of D with divbranching bisimilarity is shown in [37]. This 
paper also claims that ACTL\X is as expressive as popr and thus also adequate 
with divbranching bisimilarity, but a small mistake in the proof had the authors 
omit that the Lepr formula (7) € cannot actually be expressed in ACTL\X. It 
remains true that ACTL\X is preserved by divbranching bisimilarity. 

In [13], it is shown that ACTL\X is adequate with divergence sensitive 
branching bisimilarity. This bisimilarity relation is equivalent to divbranching 
bisimilarity [21-23] only in the case of deadlock-free LTS, but it differs in the 
presence of deadlock states since it does not distinguish a deadlock state from a 
self 7-loop (which can instead be recognized in Lf?” with the (7) @ formula). 


8 Conclusion 


This work enhances the reductions that can be obtained by combining compo- 
sitional LTS construction with an analysis of the temporal logic formula to be 
verified. In particular, known results about strong and divbranching bisimilari- 
ties have been combined into a new family of relations called sharp bisimilarities, 
which inherit all nice properties of their ancestors and refine the state of the art 
in compositional verification. 

'This new approach is promising. Yet, to be both usable by non-experts and 
fully efficient, at least two components are still missing: (1) The sets of strong 
actions, which are a key ingredient in the success of this approach, still have to 
be computed either using pencil and paper or using tools dedicated to restricted 
logics; automating their computation in the case of arbitrary L, formulas is 
not easy, but likely feasible, opening the way to a new research track; finding 
a minimal set of strong actions automatically is challenging, and since it is 
not unique, even more challenging is the quest for the set that will incur the 
best reductions. (2) Efficient algorithms are needed to minimize LTS for sharp 
bisimilarity; they could probably be obtained by adapting the known algorithms 
for strong and divbranching minimizations (at least using some kind of signature- 
based partition refinement algorithm in the style of Blom et al. [3-5] in a first 
step), but this remains to be done. 
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Abstract. Quantization converts neural networks into low-bit fixed- 
point computations which can be carried out by efficient integer-only 
hardware, and is standard practice for the deployment of neural net- 
works on real-time embedded devices. However, like their real-numbered 
counterpart, quantized networks are not immune to malicious misclas- 
sification caused by adversarial attacks. We investigate how quantiza- 
tion affects a network's robustness to adversarial attacks, which is a 
formal verification question. We show that neither robustness nor non- 
robustness are monotonic with changing the number of bits for the rep- 
resentation and, also, neither are preserved by quantization from a real- 
numbered network. For this reason, we introduce a verification method 
for quantized neural networks which, using SMT solving over bit-vectors, 
accounts for their exact, bit-precise semantics. We built a tool and an- 
alyzed the effect of quantization on a classifier for the MNIST dataset. 
We demonstrate that, compared to our method, existing methods for the 
analysis of real-numbered networks often derive false conclusions about 
their quantizations, both when determining robustness and when detect- 
ing attacks, and that existing methods for quantized networks often miss 
attacks. Furthermore, we applied our method beyond robustness, show- 
ing how the number of bits in quantization enlarges the gender bias of a 
predictor for students' grades. 


1 Introduction 


Deep neural networks are powerful machine learning models, and are becom- 
ing increasingly popular in software development. Since recent years, they have 
pervaded our lives: think about the language recognition system of a voice as- 
sistant, the computer vision employed in face recognition or self driving, not to 
talk about many decision-making tasks that are hidden under the hood. How- 
ever, this also subjects them to the resource limits that real-time embedded 
devices impose. Mainly, the requirements are low energy consumption, as they 
often run on batteries, and low latency, both to maintain user engagement and 
to effectively interact with the physical world. This translates into specializ- 
ing our computation by reducing the memory footprint and instruction set, to 
minimize cache misses and avoid costly hardware operations. For this purpose, 
quantization compresses neural networks, which are traditionally run over 32-bit 


(9 The Author(s) 2020 
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floating-point arithmetic, into computations that require bit-wise and integer- 
only arithmetic over small words, e.g., 8 bits. Quantization is the standard tech- 
nique for the deployment of neural networks on mobile and embedded devices, 
and is implemented in TensorF low Lite [13]. In this work, we investigate the ro- 
bustness of quantized networks to adversarial attacks and, more generally, formal 
verification questions for quantized neural networks. 


Adversarial attacks are a well-known vulnerability of neural networks [24]. 
For instance, a self-driving car can be tricked into confusing a stop sign with a 
speed limit sign [9], or a home automation system can be commanded to deac- 
tivate the security camera by a voice reciting poetry [22]. The attack is carried 
out by superposing the innocuous input with a crafted perturbation that is im- 
perceptible to humans. Formally, the attack lies within the neighborhood of a 
known-to-be-innocuous input, according to some notion of distance. The fraction 
of samples (from a large set of test inputs) that do not admit attacks determines 
the robustness of the network. We ask ourselves how quantization affects a net- 
work’s robustness or, dually, how many bits it takes to ensure robustness above 
some specific threshold. This amounts to proving that, for a set of given quanti- 
zations and inputs, there does not exists an attack, which is a formal verification 
question. 


The formal verification of neural networks has been addressed either by 
overapproximating—as happens in abstract interpretation—the space of outputs 
given a space of attacks, or by searching—as it happens in SMT-solving—for a 
variable assignment that witnesses an attack. The first category include meth- 
ods that relax the neural networks into computations over interval arithmetic 
[20], treat them as hybrid automata [27], or abstract them directly by using 
zonotopes, polyhedra [10], or tailored abstract domains [23]. Overapproximation- 
based methods are typically fast, but incomplete: they prove robustness but do 
not produce attacks. On the other hand, methods based on local gradient de- 
scent have turned out to be effective in producing attacks in many cases [16], but 
sacrifice formal completeness. Indeed, the search for adversarial attack is NP- 
complete even for the simplest (i.e., ReLU) networks [14], which motivates the 
rise of methods based on Satisfiability Modulo Theory (SM'T) and Mixed Integer 
Linear Programming (MILP). SMT-solvers have been shown not to scale beyond 
toy examples (20 hidden neurons) on monolithic encodings [21], but today's spe- 
cialized techniques can handle real-life benchmarks such as, neural networks for 
the MNIST dataset. Specialized tools include DLV [12], which subdivides the 
problem into smaller SMT instances, and Planet [8], which combines different 
SAT and LP relaxations. Reluplex takes a step further augmenting LP-solving 
with a custom calculus for ReLU networks [14]. At the other end of the spec- 
trum, a recent MILP formulation turned out effective using off-the-shelf solvers 
[25]. Moreover, it formed the basis for Sherlock [7], which couples local search 
and MILP, and for a specialized branch and bound algorithm [4]. 


All techniques mentioned above do not reason about the machine-precise 
semantics of the networks, neither over floating- nor over fixed-point arithmetic, 
but reason about a real-number relaxation. Unfortunately, adversarial attacks 
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computed over the reals are not necessarily attacks on execution architectures, 
in particular, for quantized networks implementations. We show, for the first 
time, that attacks and, more generally, robustness and vulnerability to attacks 
do not always transfer between real and quantized networks, and also do not 
always transfer monotonically with the number of bits across quantized networks. 
Verifying the real-valued relaxation of a network may lead scenarios where 


(i) specifications are fulfilled by the real-valued network but not for its quantized 
implementation (false negative), 
(ii) specifications are violated by the real-valued network but fulfilled by its 
quantized representation (false negatives), or 
(iii) counterexamples witnessing that the real-valued network violated the spec- 
ification, but do not witness a violation for the quantized network (invalid 
counterexamples/attacks). 


More generally, we show that all three phenomena can occur non-monotonically 
with the precision in the numerical representation. In other words, it may occur 
that a quantized network fulfills a specification while both a higher and a lower 
bits quantization violate it, or that the first violates it and both the higher and 
lower bits quantizations fulfill it; moreover, specific counterexamples may not 
transfer monotonically across quantizations. 

The verification of real-numbered neural networks using the available meth- 
ods is inadequate for the analysis of their quantized implementations, and the 
analysis of quantized neural networks needs techniques that account for their 
bit-precise semantics. Recently, a similar problem has been addressed for bina- 
rized neural networks, through SAT-solving [18]. Binarized networks represent 
the special case of 1-bit quantizations. For many-bit quantizations, a method 
based on gradient descent has been introduced recently [28]. While efficient (and 
sound), this method is incomplete and may produce false negatives. 

We introduce, for the first time, a complete method for the formal verification 
of quantized neural networks. Our method accounts for the bit-precise semantics 
of quantized networks by leveraging the first-order theory of bit vectors without 
quantifiers (QF_BV), to exactly encode hardware operations such as 2’comple- 
mentation, bit-shift, integer arithmetic with overflow. On the technical side, we 
present a novel encoding which balances the layout of long sequences of hardware 
multiply-add operations occurring in quantized neural networks. As a result, we 
obtain a encoding into a first-order logic formula which, in contrast to a standard 
unbalanced linear encoding, makes the verification of quantized networks prac- 
tical and amenable to modern bit-precise SMT-solving. We built a tool using 
Boolector [19], evaluated the performance of our encoding, compared its effec- 
tiveness against real-numbered verification and gradient descent for quantized 
networks, and finally assessed the effect of quantization for different networks 
and verification questions. 

We measured the robustness to attacks of a neural classifier involving 890 
neurons and trained on the MNIST dataset (handwritten digits), for quantiza- 
tions between 6 and 10 bits. First, we demonstrated that Boolector, off-the-shelf 
and using our balanced SMT encoding, can compute every attack within 16 
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hours, with a median time of 3h 41m, while timed-out on all instances beyond 6 
bits using a standard linear encoding. Second, we experimentally confirmed that 
both Reluplex and gradient descent for quantized networks can produce false 
conclusions about quantized networks; in particular, spurious results occurred 
consistently more frequently as the number of bits in quantization decreases. 
Finally, we discovered that, to achieve an acceptable level of robustness, it takes 
a higher bit quantization than is assessed by standard accuracy measures. 

Lastly, we applied our method beyond the property of robustness. We also 
evaluate the effect of quantization upon the gender bias emerging from quantized 
predictors for students’ performance in mathematics exams. More precisely, we 
computed the maximum predictable grade gap between any two students with 
identical features except for gender. The experiment showed that a substan- 
tial gap existed and was proportionally enlarged by quantization: the lower the 
number bits the larger the gap. 

We summarize our contribution in five points. First, we show that the ro- 
bustness of quantized neural networks is non-monotonic in the number of bits 
and is non-transferable from the robustness of their real-numbered counterparts. 
Second, we introduce the first complete method for the verification of quan- 
tized neural networks. Third, we demonstrate that our encoding, in contrast to 
standard encodings, enabled the state-of-the-art SMT-solver Boolector to verify 
quantized networks with hundreds of neurons. Fourth, we also show that exist- 
ing methods determine both robustness and vulnerability of quantized networks 
less accurately than our bit-precise approach, in particular for low-bit quanti- 
zations. Fifth, we illustrate how quantization affects the robustness of neural 
networks, not only with respect to adversarial attacks, but also with respect to 
other verification questions, specifically fairness in machine learning. 


2 Quantization of Feed-forward Networks 


A feed-forward neural network consists of a finite set of neurons z4,...,xj par- 
titioned into a sequence of layers: an input layer with n neurons, followed by 
one or many hidden layers, finally followed by an output layer with m neurons. 
Every pair of neurons 2; and zr; in respectively subsequent layers is associated 
with a weight coefficient w;; € R; if the layer of x; is not subsequent to that 
of xi, then we assume w;; = 0. Every hidden or output neuron 2; is associated 
with a bias coefficient b; € IR. The real-valued semantics of the neural network 
gives to each neuron a real value: upon a valuation for the neurons in the input 
layer, every other neuron x; assumes its value according to the update rule 


k 
z; = ReLU-N (b; + V | wija;), (1) 
j=l 


where ReLU-N: R — R is the activation function. Altogether, the neural net- 
work implements a function f: IR" — R™ whose result corresponds to the valu- 
ation for the neurons in the output layer. 


How Many Bits Does it Take to Quantize Your Neural Network? 83 


The activation function governs the firing logic of the neurons, layer by layer, 
by introducing non-linearity in the system. Among the most popular activation 
functions are purely non-linear functions, such as the tangent hyperbolic and 
the sigmoidal function, and piece-wise linear functions, better known as Rectified 
Linear Units (ReLU) [17]. ReLU consists of the function that takes the positive 
part of its argument, i.e., ReLU(z) = max{x,0}. We consider the variant of 
ReLU that imposes a cap value N, known as ReLU-N [15]. Precisely 


ReLU-N (x) = min(max(z, 0}, N}, (2) 


which can be alternatively seen as a concatenation of two ReLU functions (see 
Eq. 10). As a consequence, all neural networks we treat are full-fledged ReLU 
networks; their real-valued versions are amenable to state-of-the-art verification 
tools including Reluplex, but neither account for the exact floating- nor fixed- 
point execution models. 

Quantizing consists of converting a neural network over real numbers, which 
is normally deployed on floating-point architectures, into a neural network over 
integers, whose semantics corresponds to a computation over fixed-point arith- 
metic [13]. Specifically, fixed-point arithmetic can be carried out by integer-only 
architectures and possibly over small words, e.g., 8 bits. All numbers are rep- 
resented in 2's complement over B bits words and F bits are reserved to the 
fractional part: we call the result a B-bits quantization in QF arithmetic. More 
concretely, the conversion follows from the rounding of weight and bias coeffi- 
cients to the F-th digit, namely b; = rnd(2*5;) and Qj = rnd(2/'w;;) where 
rnd(-) stands for any rounding to an integer. Then, the fundamental relation 
between a quantized value a and its real counterpart a is 


ax 2a. (3) 


Consequently, the semantics of a quantized neural network corresponds to the 
update rule in Eq. 1 after substituting of z, w, and b with the respective approx- 
imants 2-Fz, 27" w, and 2-"b. Namely, the semantics amounts to 


k 
z; = ReLU{2" N) (b; + int(27F V ^ wi;2;)), (4) 


j=l 


where int(-) truncates the fractional part of its argument or, in other words, 
rounds towards zero. In summary, the update rule for the quantized semantics 
consists of four parts. The first part, ie., the linear combination Si UT, 
propagates all neurons values from the previous layer, obtaining a value with 
possibly 2B fractional bits. The second scales the result by 2-7 truncating the 
fractional part by, in practice, applying an arithmetic shift to the right of F bits. 
Finally, the third applies the bias b and the fourth clamps the result between 0 
and 27 N. As a result, a quantize neural network realizes a function f: Z" > Z™, 
which exactly represents the concrete (integer-only) hardware execution. 

We assume all intermediate values, e.g., of the linear combination, to be 
fully representable as, coherently with the common execution platforms [13], we 
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always allocate enough bits for under and overflow not to happen. Hence, any 
loss of precision from the respective real-numbered network happens exclusively, 
at each layer, as a consequence of rounding the result of the linear combination to 
F fractional bits. Notably, rounding causes the robustness to adversarial attacks 
of quantized networks with different quantization levels to be independent of one 
another, and independent of their real counterpart. 


3 Robustness is Non-monotonic in the Number of Bits 


A neural classifier is a neural network that maps a n-dimensional input to one 
out of m classes, each of which is identified by the output neuron with the largest 
value, i.e., for the output values 21,...,2m, the choice is given by 


class(z1,..., 2m) = arg max zi. (5) 
i 
For example, a classifier for handwritten digits takes in input the pixels of an 
image and returns 10 outputs 29,..., 29, where the largest indicates the digit the 
image represents. An adversarial attack is a perturbation for a sample input 


original + perturbation = attack 


that, according to some notion of closeness, is indistinguishable from the original, 
but tricks the classifier into inferring an incorrect class. The attack in Fig. 1 is 


ee 
xS 


Fig. 1: Adversarial attack. 


indistinguishable from the original by the human eye, but induces our classifier 
to assign the largest value to z3, rather than z9, misclassifying the digit as a 
3. For this example, misclassification happens consistently, both on the real- 
numbered and on the respective 8-bits quantized network in Q4 arithmetic. 
Unfortunately, attacks do not necessarily transfer between real and quantized 
networks and neither between quantized networks for different precision. More 
generally, attacks and, dually, robustness to attacks are non-monotonic with the 
number of bits. 

We give a prototypical example for the non-monotonicity of quantized net- 
works in Fig.2. The network consists of one input, 4 hidden, and 2 output 
neurons, respectively from left to right. Weights and bias coefficients, which are 
annotated on the edges, are all fully representable in Q1. For the neurons in the 
top row we show, respectively from top to bottom, the valuations obtained using 
a Q3, Q2, and Q1 quantization of the network (following Eq. 4); precisely, we 
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Fig. 2: Neural network with non-monotonic robustness w.r.t. its Q1, Q2, and Q3 quan- 
tizations. 


show their fractional counterpart z/2*. We evaluate all quantizations and obtain 
that the valuations for the top output neuron are non-monotonic with the num- 
ber of fractional bits; in fact, the Q1 dominates the Q3 which dominates the Q2 
output. Coincidentally, the valuations for the Q3 quantization correspond to the 
valuations with real-number precision (i.e., never undergo truncation), indicating 
that also real and quantized networks are similarly incomparable. Notably, all 
phenomena occur both for quantized networks with rounding towards zero (as 
we show in the example), and with rounding to the nearest, which is naturally 
non-monotonic (e.g., 5/16 rounds to 1/2, 1/4, and 3/8 with, resp., Q1, Q2, and 
Q3). 


Non-monotonicity of the output causes non-monotonicity of robustness, as 
we can put the decision boundary of the classifier so as to put Q2 into a different 
class than Q1 and Q3. Suppose the original sample is 3/2 and its class is associ- 
ated with the output neuron on the top, and suppose attacks can only lay in the 
neighboring interval 3/2 + 1. In this case, we obtain that the Q2 network admits 
an attack, because the bottom output neuron can take 5/2, that is larger than 
2. On the other hand, the bottom output can never exceed 3/8 and 1/2, hence 
Q1 and Q3 are robust. Dually, also non-robustness is non-monotonic as, for the 
sample 9/2 whose class corresponds to the bottom neuron, for the interval 9/2 
+ 2, Q2 is robust while both Q3 and Q1 are vulnerable. Notably, the specific 
attacks of Q3 and Q1 also do not always coincide as, for instance, 7/2. 


Robustness and non-robustness are non-monotonic in the number of bits 
for quantized networks. As a consequence, verifying a high-bits quantization, 
or a real-valued network, may derive false conclusions about a target lower-bits 
quantization, in either direction. Specifically, for the question as for whether an 
attack exists, we may have both (i) false negatives, i.e., the verified network is 
robust but the target network admits an attack, and (ii) false positives, i.e., the 
verified network is vulnerable while the target network robust. In addition, we 
may also have (iii) true positives with invalid attacks, i.e., both are vulnerable 
but the found attack do not transfer to the target network. For these reasons 
we introduce a verification method quantized neural network that accounts for 
their bit-precise semantics. 
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4 Verification of Quantized Networks using Bit-precise 
SMT-solving 


Bit-precise SMT-solving comprises various technologies for deciding the satisfia- 
bility of first-order logic formulae, whose variables are interpreted as bit-vectors 
of fixed size. In particular, it produces satisfying assignments (if any exist) for 
formulae that include bitwise and arithmetic operators, whose semantics corre- 
sponds to that of hardware architectures. For instance, we can encode bit-shifts, 
2's complementation, multiplication and addition with overflow, signed and un- 
signed comparisons. More precisely, this is the quantifier-free first-order theory 
of bit-vectors (i.e., QF. BV), which we employ to produce a monolithic encoding 
of the verification problem for quantized neural networks. 


A verification problem for the neural networks fi,..., fx consists of checking 
the validity of a statement of the form 
Q(yi wk) = Vily) s fx (ux) (6) 


where q is a predicate over the inputs and w over the outputs of all networks; in 
other words, it consists of checking an input-output relation, which generalizes 
various verification questions, including robustness to adversarial attacks and 
fairness in machine learning, which we treat in Sec. 5. For the purpose of SMT 
solving, we encode the verification problem in Eq. 6, which is a validity question, 
by its dual satisfiability question 


K 
eui; yk) ^ N fiy) = Zi ^ —(zi,...,ZK), (7) 
i=1 


whose satisfying assignments constitute counterexamples for the contract. The 
formula consists of three conjuncts: the rightmost constraints the input within 
the assumption, the leftmost forces the output to violate the guarantee, while 
the one in the middle relates inputs and outputs by the semantics of the neural 
networks. 

The semantics of the network consists of the bit-level translation of the up- 
date rule in Eq. 4 over all neurons, which we encode in the formula 


k k 
N xi = ReLU-(2" N) (x!) ^ x, = b; + ashr(x!, F) ^ a! = X tys. (8) 
i=1 j=l 


Each conjunct in the formula employs three variables x, x’, and z” and is made 
of three, respective, parts. The first part accounts for the operation of clamp- 
ing between 0 and 27 N, whose semantics is given by the formula ReLU-M(a) = 
ite(sign(r),0,ite(r > M, M,z)). Then, the second part accounts for the oper- 
ations of scaling and biasing. In particular, it encodes the operation of rounding 
by truncation scaling, i.e., int(2^ ^2), as an arithmetic shift to the right. Fi- 
nally, the last part accounts for the propagation of values from the previous 
layer, which, despite the obvious optimization of pruning away all monomials 
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Fig. 3: Abstract syntax trees for alternative encodings of a long linear combination of 
the form ee WiXi- 


with null coefficient, often consists of long linear combinations, whose exact se- 
mantic amounts to a sequence of multiply-add operations over an accumulator; 
particularly, encoding it requires care in choosing variables size and association 
layout. 

The size of the bit-vector variables determines whether overflows can occur. 
In particular, since every monomial w;;7; consists of the multiplication of two 
B-bits variables, its result requires 2B bits in the worst case; since summation 
increases the value linearly, its result requires a logarithmic amount of extra 
bits in the number of summands (regardless of the layout). Provided that, we 
avoid overflow by using variables of 2B + logk bits, where k is the number of 
summands. 

The association layout is not unique and, more precisely, varies with the or- 
der of construction of the long summation. For instance, associating from left 
to right produces a linear layout, as in Fig. 3a. Long linear combonations occur- 
ring in quantized neural networks are implemented as sequences of multiply-add 
operations over a single accumulator; this naturally induces a linear encoding. 
Instead, for the purpose formal verification, we propose a novel encoding which 
re-associates the linear combination by recursively splitting the sum into equal 
parts, producing a balanced layout as in Fig. 3b. While linear and balanced lay- 
outs are semantically equivalent, we have observed that, in practice, the second 
impacted positively the performance of the SMT-solver as we discuss in Sec. 5, 
where we also compare against other methods and investigate different verifica- 
tion questions. 


5 Experimental Results 


We set up an experimental evaluation benchmark based on the MNIST dataset 
to answer the following three questions. First, how does our balanced encoding 
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scheme impact the runtime of different SMT solvers compared to a standard 
linear encoding? Then, how often can robustness properties, that are proven for 
the real-valued network, transferred to the quantized network and vice versa? 
Finally, how often do gradient based attacking procedures miss attacks for quan- 
tized networks? 

The MNIST dataset is a well-studied computer vision benchmark, which 
consists of 70,000 handwritten digits represented by 28-by-28 pixel images with 
a single 8-bit grayscale channel. Each sample belongs to exactly one category 
{0,1,...9}, which a machine learning model must predict from the raw pixel 
values. The MNIST set is split into 60,000 training and 10,000 test samples. 

We trained a neural network classifier on MNIST, following a post-training 
quantization scheme [13]. First, we trained, using TensorF low with floating-point 
precision, a network composed of 784 inputs, 2 hidden layers of size 64, 32 with 
ReLU-7 activation function and 10 outputs, for a total of 890 neurons. The 
classifier yielded a standard accuracy, i.e., the ratio of samples that are correctly 
classified out of all samples in the testing set, of 94.7% on the floating-point 
architecture. Afterward, we quantized the network with various bit sizes, with 
the exception of imposing the input layer to be always quantized in 8 bits, i.e., 
the original precision of the samples. The quantized networks required at least 
Q3 with 7 total bits to obtain an accuracy above 90% and Q5 with 10 bits to 
reach 94%. For this reason, we focused our study on the quantizations from 6 
and the 10 bits in, respectively, Q2 to Q6 arithmetic. 

Robust accuracy or, more simply, robustness measure the ratio of robust 
samples: for the distance & > 0, a sample a is robust when, for all its pertur- 
bations y within that distance, the classifier class o f chooses the original class 
c = class o f(a). In other words, a is robust if, for all y 


ja- yl» X e — c= classo f(y), (9) 


where, in particular, the right-hand side can be encoded as Aja Zj € Zes for 
z — f(y). Robustness is a validity question as in Eq. 6 and any witness for 
the dual satisfiability question constitutes an adversarial attack. We checked 
the robustness of our selected networks over the first 300 test samples from the 
dataset with € = 1 on the first 200 and £ = 2 on the next 100; in particular, we 
tested our encoding using the SMT-solver Boolector [19], Z3 [5], and CVC4 [3], 
off-the-shelf. 

Our experiments serve two purposes. The first is evaluating the scalability 
and precision of our approach. As for scalability, we study how encoding layout, 
i.e., linear or balanced, and the number of bits affect the runtime of the SMT- 
solver. As for precision, we measured the gap between our method and both a 
formal verifier for real-numbered networks, i.e., Reluplex [14], and the IFGSM 
algorithm [28], with respect to the accuracy of identifying robust and vulner- 
able samples. The second purpose of our experiments is evaluating the effect 
of quantization on the robustness to attacks of our MNIST classifier and, with 
an additional experiment, measuring the effect of quantization over the gender 
fairness of a student grades predictor, also demonstrating the expressiveness of 
our method beyond adversarial attacks. 
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As we only compared the verification outcomes, any complete verifier for 
real-numbered networks would lead to the same results as those obtained with 
Reluplex. Note that these tools verify the real-numbered abstraction of the net- 
work using some form of linear real arithmetic reasoning. Consequently, rounding 
errors introduced by the floating-point implementation of both, the network and 
the verifier, are not taken into account. 


5.1 Scalability and performance 


We evaluated whether our balanced encoding strategy, compared to a standard 
linear encoding, can improve the scalability of contemporary SMT solvers for 
quantifier-free bit-vectors (QF_BV) to check specifications of quantized neural 
networks. We ran all our experiments on an Intel Xeon W-2175 CPU, with 64GB 
memory, 128GB swap file, and 16 hours of time budget per problem instance. 
We encoded each instance using the two variants, the standard linear and our 
balanced layout. We scheduled 14 solver instances in parallel, i.e., the number of 
physical processor cores available on our machine. While Z3, CVC4 and Yices2 


SMT-solver Encoding 6-bit 7-bit 8-bit 9-bit 10-bit 


Linear (standard) | 3h 25m oot oot oot oot 
Balanced (ours) 18m 1h 29m 3h 41m 5h 34m 8h 58m 
Linear (standard) | oot - - - - 


Boolector [19] 


Z3 [5] 


Balanced (ours) oot - z = - 

Linear (standard) | oom - - - E 

SIVE] Balanced (ours) oom = - = = 
: Linear (standard) oot - = E E 
Yices2 [6] Balanced (ours) oot - - " e 


Table 1: Median runtimes for bit-exact robustness checks. The term oot refers to 
timeouts, and oom refers to out-of-memory errors. Due to the poor performance of Z3, 
CVCA, and Yices2 on our smallest 6-bit network, we abstained from running experi- 
ments involving more than 6 bits, i.e., entries marked by a dash (-). 


timed out or ran out of memory on the 6-bit network, Boolector could check the 
instances of our smallest network within the given time budget, independently 
of the employed encoding scheme. Our results align with the SMT-solver perfor- 
mances reported by the SMT-COMP 2019 competition in the QF BV division 
[11]. Consequently, we will focus our discussion on the results obtained with 
Boolector. 

With linear layout Boolector timed-out on all instances but the smallest 
networks (6 bits), while with the balanced layout it checked all instances with 
an overall median runtime of 3h 41m and, as shown in Tab. 1, roughly doubling 
at every bits increase, as also confirmed by the histogram in Fig. 4. 
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Fig. 4: Runtimes for bit-exact adversarial robustness checks of a classifier trained on 
the MNIST dataset using Boolector and our balanced SMT encodings. Runtime roughly 
doubles with each additional bit used for the quantization. 


Our results demonstrate that our balanced association layout improves the 
performance of the SMT-solver, enabling it to scale to networks beyond 6 bits. 
Conversely, a standard linear encoding turned out to be ineffective on all tested 
SMT solvers. Besides, our method tackled networks with 890 neurons which, 
while small compared to state-of-the-art image classification models, already 
pose challenging benchmarks for the formal verification task. In the real-numbered 
world, for instance, off-the-shelf solvers could initially tackle up to 20 neurons 
[20], and modern techniques, while faster, are often evaluated on networks below 
1000 neurons [14,4]. 

Additionally, we pushed our method to its limits, refining our MNIST net- 
work to a four-layers deep Convolutional network (2 Conv + 2 Fully-connected 
layers) with a total of 2238 neurons, which achieved a test accuracy of 98.56%. 
While for the 6-bits quantization we proved robustness for 99% of the tested 
samples within a median runtime of 3h 39min, for 7-bits and above all instances 
timed-out. Notably, Reluplex also failed on the real-numbered version, reporting 
numerical instability. 


5.2 Comparison to other methods 


Looking at existing methods for verification, one has two options to verify quan- 
tized neural networks: verifying the real-valued network and hoping the func- 
tional property is preserved when quantizing the network, or relying on incom- 
plete methods and hoping no counterexample is missed. A question that emerges 
is how accurate are these two approaches for verifying robustness of a quantized 
network? To answer this question, we used Reluplex [14] to prove the robust- 
ness of the real-valued network. Additionally, we compared to the Iterative Fast 
Gradient Sign Method (IFGSM), which has recently been proposed to generate 
L;;-bounded adversarial attacks for quantized networks [28]; notably, IFGSM is 
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incomplete in the sense that it may miss attacks. We then compared these two 
verification outcomes to the ground-truth obtained by our approach. 

In our study, we employ the following notation. We use the term ” false nega- 
tive” (i) to describe cases in which the quantized network can be attacked, while 
no attack exists that fools the real-number network. Conversely, the term ” false 
positive” (ii) describes the cases in which a real-number attack exists while the 
quantized network is robust. Furthermore, we use the term "invalid attack" (iii) 
to specify attacks produced for the real-valued network that fools the real-valued 
network but not the quantized network. 

Regarding the real-numbered encoding, Reluplex accepts only pure ReLU 
networks. For this reason, we translate our ReLU-N networks into functionally 
equivalent ReLU networks, by translating each layer with 


ReLU-N (W - a + b) = ReLU ( - I- RLU(-W - æ — b + N)). (10) 


Out of the 300 samples, at least one method timed out on 56 samples, leaving 
us with 244 samples whose results were computed over all networks. Tab.2 
depicts how frequently the robustness property could be transferred from the 
real-valued network to the quantized networks. Not surprisingly, we observed 
the trend that when increasing the precision of the network, the error between 
the quantized model and the real-valued model decreases. However, even for the 
10-bit model, in 0.896 of the tested samples, verifying the real-valued model leads 
to a wrong conclusion about the robustness of the quantized network. Moreover, 
our results show the existence of samples where the 10-bit network is robustness 
while the real-valued is attackable and vice versa. The invalid attacks illustrate 
that the higher the precision of the quantization, the more targeted attacks need 
to be. For instance, while 94% of attacks generated for the real-valued network 
represented valid attacks on the 7-bit model, this percentage decrease to 8096 
for the 10-bit network. 


True False False True Invalid 
Bits negatives negatives positives positives attacks 


© a (i) 


6 66.496 25.096 3.396 5.396 8% 
7 84.8% 6.6% 1.6% 7.0% 6% 
8 88.5% 2.9% 0.4% 8.2% 10% 
9 91.0% 0.4% 0.4% 8.2% 20% 
10 91.096 0.496 0.496 8.2% 20% 


Table 2: Transferability of vulnerability from the verification outcome of the real- 
valued network to the verification outcome of the quantized model. While vulnera- 
bility is transferable between the real-valued and the higher precision networks, (9 
and 10-bits), in most of the tested cases, this discrepancy significantly increases when 
compressing the networks with fewer bits, i.e. see columns (i) and (ii). 
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Next, we compared how well incomplete methods are suited to reason about 
the robustness of quantized neural networks. We employed IFGSM to attack the 
244 test samples for which we obtained the ground-truth robustness and mea- 
sure how often IFGSM is correct about assessing the robustness of the network. 
For the sake of completeness, we perform the same analysis for the real-valued 
network. 


True False False True 
Bits negatives negatives positives positives 


(i) (ii) 


6 69.7% 1.2 96 = 30.3% 
T 86.5% 1.6 % - 13.5% 
8 88.9% 0.8 % - 11.1% 
9 91.4% 0.8 % 5 8.6 % 
10 91.496 0% * 8.6 % 
R 91.496 0% s 8.6 96 


Table 3: Transferability of incomplete robustness verification (IFGSM [28]) to ground- 
truth robustness (ours) for quantized networks. While for the real-valued and 10-bit 
networks our gradient based incomplete verification did not miss any possible attack, a 
non-trivial number of vulnerabilities were missed by IFGSM for the low-bit networks. 
'The row indicted by R compares IFGSM attacking the floating-point implementation 
to the grouth-truth obtained, using Reluplex, by verifying the real-valued relaxation 
of the network. 


Our results in Tab. 3 present the trend that with higher precision, e.g., 10- 
bits or reals, incomplete methods provide a stable estimate about the robustness 
of the network, i.e., IFGSM was able to find attacks for all non-robust samples. 
However, for lower precision levels, IFGSM missed a substantial amount of at- 
tacks, i.e., for the 7-bit network, IFGSM could not find a valid attack for 10% 
of the non-robust samples. 


5.3 The effect of quantization on robustness 


In Tab. 3 we show how standard accuracy and robust accuracy degrade on our 
MNIST classifier when increasing the compression level. The data indicates a 
constant discrepancy between standard accuracy and robustness; for real num- 
bered networks, a similar fact was already known in the literature [26]: we empir- 
ically confirm that observation for our quantized networks, whose discrepancy 
fluctuated between 3 and 496 across all precision levels. Besides, while an ac- 
ceptable, larger than 9096, standard accuracy was achieved at 7 bits, an equally 
acceptable robustness was achieved at 9 bits. 

One relationship not shown in Tab. 3 is that these 4% of non-robust samples 
are not equal for across quantization levels. For instance, we observed samples 
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Precision | 6 7 8 9 10 | R 


Standard | 73.4% 91.8% 92.2% 94.3% 95.5% | 94.7% 
Robust | 69.7% 86.5% 88.9% 91.4% 91.4% | 91.4% 


Table 4: Accuracy of the MNIST classifiers on the 244 test samples for which all 
quantization levels could be check within the given time budget. The column indicated 
by IR compares the accuracy of the floating-point implementation to the robust accuracy 
of the real-valued relaxation of the network. 


that are robust for 7-bit network but attackable when quantizing with 9- and 10- 
bits. Conversely, there are attacks for the 7-bit networks that are robust samples 
in the 8-bit network. 


5.4 Network specifications beyond robustness 


Concerns have been raised that decisions of an ML system could discriminate 
towards certain groups due to a bias in the training data [2]. A vital issue in 
quantifying fairness is that neural networks are black-boxes, which makes it hard 
to explain how each input contributes to a particular decision. 

We trained a network on a publicly available dataset consisting of 1000 stu- 
dents’ personal information and academic test scores [1]. The personal features 
include gender, parental level of education, lunch plans, and whether the stu- 
dent took a preparation course for the test, all of which are discrete variables. We 
train a predictor for students’ math scores, which is a discrete variable between 
0 and 100. Notably, the dataset contains a potential source for gender bias: the 
mean math score among females is 63.63, while it is 68.73 among males. 

'The network we trained is composed of 2 hidden layers with 64 and 32 units, 
respectively. We use a 7-bit quantization-aware training scheme, achieving a 
4.14% mean absolute error, i.e., the difference between predicted and actual 
math scores on the test set. 

'The network is fair if the gender of a person influences the predicted math 
score by at most the bias 8. In other words, checking fairness amounts to verifying 
that 

A Sj — UA Sgender £ tgender = |f (s) = f(t)| < B, (11) 


izgender 


is valid over the variables s and t, which respectively model two students for 
which gender differs but all other features are identical—we call them twin stu- 
dents. When we encode the dual formula, we encode two copies of the semantics 
of the same network: to one copy we give one student s and take the respective 
grade g, to the other we give its twin t and take grade h; precisely, we check for 
the unsatisfiability the negation of formula in Eq. 11. Then, we compute a tight 
upper bound for the bias, that is the maximum possible change in predicted 
score for any two twins. To compute the tightest bias, we progressively increase 
B until our encoded formula becomes unsatisfiable. 
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We measure mean test error and gender bias of the 6- to the 10-bits quanti- 
zation of the networks. We show the results in Tab. 5. The test error was stable 


Quantization Mean Tightest bias 


level test error upper bound 
6 bits 4.46 22 
T bits 4.14 17 
8 bits 4.37 16 
9 bits 4.38 15 
10 bits 4.59 15 


Table 5: Results for the formal analysis of the gender bias of a students’ grade predic- 
tor. The maximum gender bias of the network monotonically decreases with increasing 
precision. 


between 4.1 and 4.696 among all quantizations, showing that the change in pre- 
cision did not affect the quality of the network in a way that was perceivable 
by standard measures. However, our formal analysis confirmed a gender bias in 
the network, producing twins with a 15 to 21 difference in predicted math score. 
Surprisingly, the bias monotonically increased as the precision level in quantiza- 
tion lowered, indicating to us that quantization plays a role in determining the 
bias. 


6 Conclusion 


We introduced the first complete method for the verification of quantized neural 
networks which, by SMT solving over bit-vectors, accounts for their bit-precise 
semantics. We demonstrated, both theoretically and experimentally, that bit- 
precise reasoning is necessary to accurately ensure the robustness to adversarial 
attacks of a quantized network. We showed that robustness and non-robustness 
are non-monotonic in the number of bits for the numerical representation and 
that, consequently, the analysis of high-bits or real-numbered networks may de- 
rive false conclusions about their lower-bits quantizations. Experimentally, we 
confirmed that real-valued solvers produce many spurious results, especially on 
low-bit quantizations, and that also gradient descent may miss attacks. Addi- 
tionally, we showed that quantization indeed affects not only robustness, but 
also other properties of neural networks, such as fairness. We also demonstrated 
that, using our balanced encoding, off-the-shelf SMT-solving can analyze net- 
works with hundreds of neurons which, despite hitting the limits of current 
solvers, establishes an encouraging baseline for future research. 
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Abstract. We present a methodology for generating a characterization 
of the memory used by an assembly program, as well as a formal proof 
that the assembly is bounded to the generated memory regions. A for- 
mal proof of memory usage is required for compositional reasoning over 
assembly programs. Moreover, it can be used to prove low-level security 
properties, such as integrity of the return address of a function. Our ver- 
ification method is based on interactive theorem proving, but provides 
automation by generating pre- and postconditions, invariants, control- 
flow, and assumptions on memory layout. As a case study, three binaries 
of the Xen hypervisor are disassembled. These binaries are the result 
of a complex build-chain compiling production code, and contain vari- 
ous complex and nested loops, large and compound data structures, and 
functions with over 100 basic blocks. The methodology has been success- 
fully applied to 251 functions, covering 12,252 assembly instructions. 


Keywords: Formal Verification - Assembly - x86-64 - Memory Usage 


1 Introduction 


'This paper presents a formal methodology for reasoning over the memory usage 
of functions in a software suite. Various security properties require knowledge 
on memory usage. For example, proving absence of buffer overflows requires 
proving that a function does not write outside certain memory regions. Control- 
flow integrity requires showing, among other things, that the return address 
cannot be overwritten [61]. The security property called non-interference requires 
reasoning over which parts of the memory are used by which functions [50]. 
Moreover, memory usage is crucial for compositional reasoning over assembly 
code. Typically, compositional reasoning requires proving that certain code frag- 
ments are spatially independent [45,47]. A proof of memory usage can be used to 
prove such independence, thereby allowing composition. Consider a function g 
that at some point calls function f. Compositional reasoning means that a veri- 
fication effort over f can be reused for verification of g without unfolding it. This 
at least requires that the verification effort over f establishes that f does not 
modify the stack frame of g. More generally, compositional reasoning requires 
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at least knowing that f restricts itself to certain parts of the memory. This is 
exactly what is established by proving memory usage. 


Memory usage cannot satisfactorily be expressed at the source-code level. As 
an illustration, consider formulating a property that a function cannot overwrite 
its own return address. This requires knowledge on the values of the stack and 
frame pointers, making it an assembly-level property. At the assembly level, one 
can easily express a property formulating that the memory at the top of the 
stack frame (where the return address is stored) should remain unmodified. 


Reasoning over assembly, however, is complicated due to the semantical gap 
between assembly and source code. In assembly code, ostensibly simple com- 
putations can be implemented using complex sequences of low-level operations. 
For example, a simple integer division by 10 can be implemented with a series of 
bit-level operations. Assembly code does not have types. It is common to, e.g., 
mix logical bitwise operators with signed integer arithmetic, or floating-point 
operations with bitvector operations. Assembly code does not have a clear dis- 
tinction between stack frame and heap. Whether some address refers to a local 
variable stored in the stack, a global variable, or part of the heap, is provable 
only by adding assumptions on memory layout. Finally, assembly does not have 
a clear notion of scoping. Function calls are not necessarily clearly delineated, 
and instead of assuming that a function cannot write to a variable it has no 
access to (such as a local variable of another function), this has to be proven. 


The contribution of this paper consists of a formal, compositional and highly 
automated methodology for reasoning over memory usage at the assembly- 
level.? Our approach first uses untrusted tools to generate a formal memory 
usage certificate (see Section 2). This certificate contains 1.) theorems on mem- 
ory usage, 2.) the preconditions under which memory usage can be shown, and 
3.) proof ingredients. These proof ingredients contain assumptions on memory 
layout, control-flow information, and invariants. Section 2 provides an example 
of a function that theoretically can overwrite its own return address. We show 
that the certificate provides preconditions and a formal proof that a return- 
address-based exploit is not possible under those preconditions. 


The certificate and the original assembly are loaded into an interactive the- 
orem prover (ITP). Memory usage in general is an undecidable property (Rice’s 
theorem [48]), which is why we aim for an ITP environment to allow user in- 
teraction when necessary. Using the proof ingredients, the certificate is formally 
proven correct with minimal user interaction, making use of customized proof 
strategies. Section 3 describes certificate verification and composition. 

To demonstrate applicability and scalability, we apply the methodology to 
x86-64 binaries of the Xen hypervisor [13] (see Section 4). The binaries are ob- 
tained via the standard Xen build process, including optimizations. The binaries 
are decompiled to assembly using off-the-shelf disassembly tools. Our method- 
ology is applied to 251 functions; for each function a certificate is automatically 
generated, and a proof is finished in the Isabelle/HOL theorem prover [44]. With- 


3 All code and proofs are publicly available [57]. 
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out exception, the manual interaction consists of elementary interactive theorem 
proving such as applying the proper proof method. 

While past work [38,41,25] on assembly-level formal verification exists, the 
degree of either scalability or automation is limited. As example of interactive 
theorem proving, Boyer and Yu verified machine-code implementations of vari- 
ous standard sort- and string functions, requiring over 19,000 lines of manually 
written proof code for the verification of roughly 900 instructions [8]. As exam- 
ple of automated theorem proving, Tan et al. presented an approach which takes 
about 6 hours for a 533-instruction string search algorithm [56]. In constrast, this 
paper involves a degree of user interaction of ~85 lines of proof code per 1,000 
lines of assembly. Our work is able to almost fully automatically verify 12,252 
instructions from real world industrial binaries compiled by a real world build 
process. Section 5 discusses prior art, its contrast with the paper’s work, and 
the paper’s contributions. To the best of our knowledge, there is no related work 
that is able to achieve similar scalablity and automation on real world binaries. 


2 Formal Memory Usage Certificates 


Figure 1 provides an example of a formal memory usage certificate (FMUC). 
The FMUC is generated automatically from an assembly file. This assembly file 
may be produced from a binary using a disassembler such as objdump, IDA 
Ghidra's decompiler,? or Capstone [46]. In case source code is available, the 
assembly code can also be produced directly by a compiler. In this example, 
the C code of Figure 1a is used solely for presentation, the input to the FMUC 
generation is the assembly created by decompiling the corresponding binary. For 
each function in the assembly file, an FMUC is produced. External functions, 
for example due to dynamic linking, are treated as black boxes (see Section 3.4). 

An FMUC consists of two parts: a memory usage theorem and its proof (see 
Figure 1c). The theorem consists of assumptions implying a Hoare triple [28,40] 
over the function. The Hoare triple is specific to memory usage. Intuitively, 
it means that from a state satisfying precondition P, after execution of code 
fragment f, the state satisfies postcondition Q (as in normal Hoare triples). The 
Hoare triple also contains a memory region set M. Besides its regular meaning, 
the Hoare triple expresses that any write that occurs during execution of f occurs 
within one of the memory regions in this set. 

'The term memory usage formally denotes an overapproximation of the mem- 
ory written to by a function. Thus, any address that is not enclosed in one of the 
regions of M, is guaranteed to be preserved. Set M, however, will also include 
the memory regions read by the function, for verification purposes. 

'The precondition P expresses that the instruction pointer rip is at the entry 
point of the function. It also provides initial symbolic values for all registers and 
memory regions that are read (e.g.: rsp = rspg). Finally, it formulates that 
the return address is stored at the top of the stack frame. The postcondition Q 


^ https: / /www.hex-rays.com/products/ida/index.shtml 
5 https: //ghidra-sre.org/ 
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int intide - Xi 
int main(int argc, char» argv[]) { Block 1149—»120b; 


int* a = (int*)argv; i 
int* b = (int*)(argv + 4); pns 

` Block 123e—>1244; 
rlinta) (argi + 2) Sowa ot +b; If SF 4 OF Th 
*(char*)argv = ’a’; 7 i" 


Block 120d—>123a 
Else Break Fi 
Pool; 
Block 1246—>1249; 
Block 124b—>124b;-— call to is even 
Block 1250—>1252; 
If ZF Then 
Block 1263—>1267 
Else Block 1254—>1261 Fi; 
Block 1269—>1279; 
If ZF Then 
Block 1280—>1285 
Else Block 127b—>127b Fi 


int array [argc]; 


for (int i = 0; i < argc; i++) { 
array [i] argv[i] [0] * 2; 


Y 


if (is even(argc)) { 
return array [argc]; 
} 


return array [0]; 


(a) C Code (b) Syntactic Control Flow f 


thm: MRR = {P} f{Q; M} 
proof: 
apply (check_scf_step)+ 
apply (check scf while "Pi23e || Pis") 
apply (check. scf step)-* 
where: 
P=rip=1149Arsp=rsp,/\...A*[rsp,8] = ret addr 
Q=rip=ret_addr\rsp=rsp,+8/...A*[rspo,8] = ret addr 


(c) Theorem and proof code 
M  -—ía-|rsp,,8], b = [fso + 40,8], c = [rsio + 36,4, d = [rsp — 8, 8],...} 
MRR = {a,b,c,d,...} are separate 
(d) The memory regions and their relations for block 123e—>1244. 


Pr23e(0) = rip = 123e 
rbp = rspoy — 8 
rdi = rdio 
rsp = rsp, — (88 + 16 » ((15 + 4 * sextend((31,0)rdio)) / 16)) 


*[rsp, — 40, 8] = rsp) — (85 + 16 * ((15 + 4 x sextend((31,0)rdio)) / 16)) >> 2 << 2 
*[rspg — 48, 8] = sextend((31,0)rdio) — 1 
*[rspg — 56,8] = rsio + 32 


(e) Invariant at line 0x123e (only 7 out of 23 equations shown) 


(Piza») | is_even | {Pi250; Mis even} 


(f) Assumption due to call of function is even 


Fig. 1: An FMUC. Region fa, s] denotes a region of s bytes starting at 64-bit 
address a. Notation *r denotes reading region r in little-endian fashion. Notation 
(31, 0)rdig takes the lower 32 bits of the register. 
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expresses that the function has returned, i.e., the instruction pointer is equal to 
the return address and the stack pointer rsp is equal to its original value plus 
eight. For any callee-saved register, i.e., any register whose value is assumed to 
be preserved by the function call, it will say that its value is unchanged. 


The component f of the memory usage theorem is a representation of the 
control flow of the function in terms of syntactic structures such as basic blocks, 
loops and if-then-else statements (see Figure 1b). We call this the syntactic 
control flow (SCF). The SCF is automatically generated from the control flow 
graph (CFG). The reason that a syntactic structure is required, is because the 
proof is done using Hoare logic, which is guided by syntax. The proof of an FMUC 
of an entire function is based on FMUCs per basic block. Thus one FMUC is 
generated per basic block, and one corollary FMUC for the entire function. 


The proof consists of two further proof ingredients: memory region relations 
and invariants. We zoom in on block 123e—>1244 to explain both of these. The 
FMUC provides 13 regions for this block, of which 4 are shown (see Figure 1d). 
Region a stores the return address. Region b depends on the segment register 
fs and stores the canary [15]. Region c is based on the pointer passed as second 
argument to the function. Finally, region d is part of the stack frame. The gener- 
ated memory region relations assume that all these regions are separate. Out of 
the per-block memory regions and their relations, memory regions and relations 
for the function as a whole are composed. 


For each basic block, an invariant is generated. Stronger invariants can lead 
to a tighter approximation of memory usage. The invariant assigned to block 
123e—>1244 is effectively a loop invariant (see Figure 1e). The frame pointer 
rbp is equal to the original stack pointer minus eight. Register rdi has not 
been touched. We also show some of the more complex invariants, such as the 
value of the stack pointer. In total, the loop invariant provides information on 
11 registers and 12 memory locations for this basic block. Note that the FMUC 
provides preconditions in terms of the initial state of the corresponding basic 
block. In Section 3.2 these are lifted to preconditions in terms of the initial state 
of the function. 


For this example, we treated is. even as an external function (see Figure 1f). 
An assumption was thus generated, that expresses that the memory usage of that 
function suffices to show that the invariant at line 124b implies the invariant at 
line 1250. This means, among others, that the memory used by is_even (denoted 
Mis even) Should not overlap with regions a through d. Section 3.4 provides more 
information on composition. 


The FMUC is generated automatically, except for the three line proof in Fig- 
ure lc. Due to the undecidability of memory usage, interaction may be required. 
Isabelle/HOL proof strategies are provided to assist in that interaction. Sec- 
tion 3 provides more details. The manual effort required in proving the FMUC 
for this function, consists simply of calling the proper proof strategies. First, 
check_scf_step is run, applying Hoare logic rules and proving correctness of the 
memory usage until the loop. Then, the proof strategy for dealing with the loop 
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is called, with the invariant generated from the FMUC. Finally, check_scf_step 
is called again, which is able to verify the remainder of the function. 

Finally, note that without any assumptions the function could overwrite its 
own return address at various places. The memory region relations MRR are 
sufficiently strong to exclude this. These relations thus form the preconditions 
under which a return-address exploit is impossible. As example, they assume that 
regions a and c are separate. This means that the address stored in parameter 
argv (reflected as rsio at the assembly level) is not allowed to point to a region 
within the stack frame of function main. 

Due to space restriction, we omit details on the algorithms that generate an 
FMUC. In general, none of the FMUC generation is part of the trusted comput- 
ing base. That is, none of the algorithms need to be backed up by formal proofs. 
The output of the FMUC generation is imported into Isabelle/HOL, where it is 
proven correct. If there is an error in CFG generation, control flow extraction, 
symbolic execution, or in the generated invariants, then the certificate cannot 
be proven in Isabelle/HOL. One exception is the memory region relations. They 
are assumptions, and if they are internally inconsistent this leads to a vacuous 
truth. For that reason, Z3 is used to generate them [39], making it impossible to 
introduce, e.g., a relation where two overlapping regions are considered separate. 


3 FMUC Verification 


This section presents the verification of an FMUC. Both the FMUC and the 
original assembly are loaded into Isabelle/HOL. The theorem is then proven 
using the proof ingredients stored in the FMUC. This means that given a step 
function that models the semantics of the assembly instructions, the Hoare triple 
is verified. 

Let step :: I x S x S +> B be a transition relation. It takes as input an 
instruction of type J and two states o and o’. It returns true if and only if 
execution of the instruction in state o can produce state o’. Undefined behavior, 
such as null-pointer dereferencing, is modeled by relating a state to any successor 
state. The semantics of a syntactic control flow (SCF) are straightforwardly 
defined by a function exec. scf :: SCF x S x S++ B (here SCF denotes the type 
of a syntactic control flow object). In case of loops the function is defined using 
a least fixed point construction. This way, if the halting condition is never met, 
there exists no related o’. 

First, we define the notion of memory usage wrt. a certain state change: 


Definition 1. The set of memory regions M is the memory usage wrt. the state 
change from a to o', if and only if, any byte at an address a not inside one of 
the regions is unchanged. 


usage(M,o,0’)=Va- (Yre M - fa, 1] rar) = 0': *[a,1] = o : xfa, 1] 


Here, notation o : *[a,s] means reading in little-endian fashion s bytes from 
memory address a in state ø. Notation rg ba rı denotes that two regions are 
separate. 
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Definition 2. A memory usage Hoare triple is defined as: 
(P) f {Q;M}=Veo o' - P(o) ^exec scf(f,o,0") — Q(c’) ^ usage(M,a, c") 


In words, Definition 2 states the following: if precondition P holds on the 
initial state c and o’ can be obtained by executing f, postcondition Q holds on 
the produced state and the values stored in all memory regions outside set M 
are preserved. 


3.1 Verification Tools Used 


Isabelle/HOL The theorem prover utilized in this work was Isabelle 2018 [44]. 
It is a generic tool with a flexible, extensible syntactic framework. Isabelle also 
utilizes a powerful proof language known as intelligible semi-automated reason- 
ing (Isar) [59] and a proof strategy language called Eisbach [37]. We made heavy 
use of Word library [17]. This library provides a limited-precision integer type, 
^a word, where ?a is the number of bits in the integer. Various operations are 
provided for manipulation of and arithmetic involving formal words, including bit 
indexing, bit shifting, setting specific bits, and signed and unsigned arithmetic. 
Operators for inequality are also included, as well as operations for converting 
between word sizes. 

Machine Model and Instruction Semantics Heule et al. provide seman- 
tics of the x86-64 architecture [27]. Instead of manually codifying instruction 
semantics, they applied machine learning to derive semantics from a live x86 
machine. This produced highly reliable semantics: they compared the seman- 
tics to manually written semantics based on the Intel reference manuals, and 
found that in the few cases where they differed the Intel manuals were wrong. 
Roessle et al. embedded these semantics into the Isabelle/HOL theorem prover 
and tested the formal Isabelle semantics against live x86 hardware [49]. This 
formal machine model is the base of our verification effort. 

Symbolic Execution Bockenek et al. provide an Isabelle/HOL symbolic 
execution engine based on the above semantics [6]. Effectively, this provides a 
function symb exec that symbolically runs basic blocks. Let ao and a, be the 
start- and end-addresses of the block. A call to symb_ exec(ao, a1, 0,0’) returns 
true if and only if state o’ is the result of symbolically executing the block from 
state c. The symbolic execution is completely written in Isabelle/HOL, meaning 
that every rewrite rule has been formally proven correct. 


3.2 Per-block Verification 


Verification occurs by first verifying per basic block. Figure 2a shows an introduc- 
tion rule for establishing a Hoare triple over a basic block. The first assumption 
requires the symbolic execution method to run over a universally quantified sym- 
bolic state o that satisfies the precondition. Any resulting state o’ should satisfy 
the postcondition Q, and the set of memory regions M generated for the block 
should be correct. 
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The second assumption is required because of an important subtlety: the 
regions generated in the FMUC are expressed in terms of the initial state of 
their basic block. However, it makes no sense to express the regions used by 
individual blocks within a larger function in terms of their own initial state. If a 
region of a basic block somewhere within a function body depends on, e.g., the 
value of register rdi at the start of that block, then it is unsound to express that 
memory region in terms of rdig, i.e., the value of rdi at the start of the function. 
Therefore, the Hoare triples are defined based on a set of memory regions M" 
that solely depends on the initial state of the function. For each block, that set is 
obtained by taking the generated set of memory regions M (expressed in terms 
of the initial state of the block) and applying it to any state that satisfies the 
current invariant. This produces a set of regions expressed in terms of the initial 
state of the function. 

An Isabelle proof strategy has been implemented that, given the proof ingre- 
dients from the FMUC, discharges this introduction rule. The proof strategy runs 
symbolic execution within Isabelle/HOL, proves the postcondition and proves 
the memory usage. The open variables P, Q, ao, a; and M are all provided by 
the FMUC. No interaction is required; for basic blocks the proof is automated. 


3.3 Verification of Function Body 


Yo o’ - P(c) ^symb exec(ao,a1,0, 0^) => Q(c') ^ usage(M (0), o, c") 
M ={r | 3o - P(c) Are M(o) } 


(P) Block ao—>ai (Q; M’} 


(a) Introduction rule 


{P} f {Q; Mi) {Q} g (R5 Mo} M=M UM? 
{P} f;g {R; M} 


(b) Sequence rule 


[IAB)LI(ISM) =I — IA-B—Q 
{I} While B DO f oD (Q; M} 
(c) While rule 


Fig.2: Hoare rules for memory usage 


For each syntactic construct, a Hoare rule is defined (see Figure 2). The 
sequence and conditional rules (only first is shown) are straightforward: the 
memory usage is the union of the memory usage of the constituents. Note that 
the sequence rule is sound only because the memory predicates are independent 
of the initial state of the basic blocks, as discussed above. 

The while rule is based on a loop invariant J. If the memory usage of one 
iteration of function body f is constrained to the set of memory regions M, then 
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that holds for the entire loop. This sounds counterintuitive. Consider a simple C- 
like loop iterating from 7 = 0 while i < 10 and as body the assignment afi] = 0, 
i.e., it writes to the ith element of an array. Verification of the loop requires 
the invariant (øo) = i(c) < 10. The FMUC of the loop body will have a set of 
memory regions M(c) = {[a+i(c), 1]}, i.e, one region of one byte, expressed in 
terms of the initial state of the basic block. Now consider the application of the 
introduction rule to the block of the loop body. It will introduce a Hoare triple 
with: 


M={r | do - I(c) ^r € M(o)} 
={r | do - i(o) « 10Ar= [a+ i(c),1]) 
= { [a',1] | a€a' <a+10} 


The set M' is actually the memory used by the entire loop. This is because 
the introduction rule applies the state-dependent set of memory regions to any 
state that satisfies the invariant. This shows that the strength of the generated 
invariants influences the tightness of the overapproximation of memory usage. A 
weaker invariant, e.g., i « 20, would produce a larger set of memory regions. 
An Isabelle/HOL proof strategy is implemented that automatically applies 
the proper Hoare logic rule. It is driven by the syntactic control flow provided 
by the FMUC. For function bodies without loops, this proof strategy requires 
no further interaction. For each loop entry, it is required to manually apply the 
weaken rule to show that the postcondition of the block before entry implies the 
loop invariant. Without exception, each of these proofs could be finished using 
standard off-the-shelf Isabelle/HOL tools. The part that is usually the most 
involved — defining the invariants — is taken care of by the FMUC generation. 


3.4 Composition 


Let f bea function body. Assume that the function has been verified, i.e., a Hoare 
triple has been proven of the form: (Py) f (Qy; My}. In order to composably 
reuse that verification effort, function f is considered to be a black box once it 
is verified. Now consider a function g calling function f: 


a0: push rbp 
al: call f 
a2: pop rbp 
a3: ret 


Let P denote the precondition right before executing the assembly instruction 
call. Precondition P contains the equality *[rspj — 8,8] = rbpj, expressing 
that function g has pushed frame pointer rbp into its own local stack frame. Let 
Q denote the postcondition just after returning, but before executing pop. The 
postcondition of g expresses that callee-saved register rbp is properly restored, 
i.e., rbp = rbpj. That is indeed done by the pop instruction. In order to prove 
proper restoration of rbp, it must be proven that function f did not overwrite 
any byte in region [rsp{ — 8,8]. Additionally, function f must be proven not to 
overwrite region [rspg, 8] which stores the return address of g. For this particular 
instance of calling f, it thus must be proven that f preserves these two regions. 
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More generically, function f can be called by various functions other than g. 
For each call the specific requirements on which memory regions are required to 
be preserved differ. Thus, to be able to verify function f once, and reuse that 
verification effort for each call, the verification effort must at least contain an 
overapproximation of the memory written to by function f. Note that this is 
exactly the requirement when using separation logic [45,47,33]. Separation logic 
provides a frame rule for compositional reasoning. This frame rule informally 
states that if a program can be confined to a certain part of a state, properties 
of this program carry over when the program is part of a bigger system. 

We thus provide a version of the frame rule of separation logic, specific to 
memory usage verification (see Figure 3). Effectively, this rule is used to prove 
that the memory usage of a caller function g is equal to the memory it uses 
itself, plus the memory used by function f. It requires four assumptions. First, 
it assumes function f has been verified for memory usage, with My denoting that 
memory usage. Second, it assumes that precondition P can be split up into two 
parts: precondition P; required to verify function f, and a separate part Pep. 
'The separate part is specific to the actual call of the function. In the example, 
Pep will contain the equality [rsp — 8,8] = rbpj. Third, the correctness of the 
set of memory regions My should suffice to prove that the separated part Pep 
is preserved. In the example, this effectively means that My should not overlap 
with the two regions of g. Fourth, Psep and Qp should imply postcondition Q. 


{Pr} f (Qi My} 

P => P; A Poep 

Yoo’ - usage(My,0,0') ^ Psep(o) — Psep(o’) 
Qs ^ Pp => Q 


(Pj Call f {Q; My} 


Fig. 3: Frame rule for composition of memory usage 


In practice, many functions will not be part of the assembly code under veri- 
fication (e.g., external calls). We thus have to generate the assumptions required 
to proceed with verification. To this end, we introduce the following notation: 


{P}|£]{Q; My} =3 Py Qs Psep - four assumptions of frame rule are satisfied 


Making this assumption informally expresses that function f is assumed to have 
been verified. Its memory usage My is assumed to suffice to prove that we could 
step from states satisfying P to states satisfying Q. 


4 Case Study: Xen Project 


The Xen Project [13] is a mature, widely-used virtual machine monitor (VMM), 
also known as a hypervisor. Hypervisors provide a method of managing multiple 
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virtual instances of operating systems (called guests or domains) on a physical 
host. The Xen hypervisor is a suitable case study because of its security rele- 
vance and its complex build process involving real production code. Security is a 
significant issue in environments where hypervisors are used, such as the Ama- 
zon Elastic Compute Cloud (Amazon EC2), Rackspace Cloud, and many other 
cloud service providers. For example, when one or more physical hosts support 
virtual guests for any number of distinct users, ensuring isolation of the guest 
operating systems (OSs) is important. The Xen build process produces multi- 
ple binaries that contain functions not present in the Xen source itself. This is 
due to the inclusion of external static libraries and programs. We used Xen 4.12 
compiled with GCC 8.2 via the standard Xen build process. This build process 
uses various optimization levels, ranging from 01 to 03. 


Of the binaries produced by the Xen build process, we considered three: 
xenstore, xen-cpuid, and qemu-img-xen. The xenstore binary is involved in 
the functionality of XenStore,? a hierarchical data structure shared amongst 
all Xen domains. The xen-cpuid utility queries the underlying processors and 
displays information about the features they support. The third binary, qemu- 
img-xen, consists of over three hundred functions that are not present in the Xen 
source code. It provides some of the functionality of Quick Emulator (QEMU). 
QEMU is a free, open-source emulator.’ Xen uses it to emulate device models 
(DMs), which provide an interface for hardware storage. 


Binaries Function Count Instruction Count Loops Manual Lines of Proof 
xenstore 2/6 100 0 6 
xen-cpuid 2/3 210 2 39 
qemu-img-xen 247/343 11,942 64 1,002 
Total 251/352 12,252 65 1,047 
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Fig. 4: Case Study Overview 


9 https: //wiki.xen.org/wiki/XenStore 
7 https:/ /www.qemu.org/ 
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Our methodology is currently capable of dealing with 71% of the functions 
present in these binaries (see Figure 4). The supported features include (nested) 
loops, subcalls, variable argument lists, jumps into other function bodies, string 
instructions with the rep prefix. There is no particular limit on function size. 
The average number of instructions per function analyzed is 49. Some of the 
functions analyzed have over 300 instructions and over 100 basic blocks. 

There are five categories of features we do not support. The first and most 
common is indirection, accounting for 19%. Indirection involves a call or jump 
instruction that loads the target address from a register or memory location 
rather than using a static value. Switch statements and certain uses of goto are 
the most common causes of indirect jumps. Indirect calls generally result from 
usage of function pointers. For example, the main functions of all three verified 
binaries used switch statements in loops in the process of parsing command line 
options. These statements introduced indirect branches. 

The second category involves issues related to generating the memory region 
relations. This step requires solving linear arithmetic over symbolically computed 
addresses. Sometimes, addresses are computed using a combination of arithmetic 
operators with bitwise logical operators. In some of these cases, our translation 
to Z3 does not produce an answer. As an example, function qcow_open uses 
the rotate-left function to compute an address. As another example, function 
AES_set_encrypt_key produces addresses that are obtained via combinations 
of bit-shifting, bit masking, and xor-ing. 

The instruction repz cmps is currently not supported for technical reasons. It 
is the assembly equivalent of the function strncmp, but instead writes its result to 
a flag. Various other string-related instructions with the rep prefix are supported. 
Functions with recursion, a minority in systems code, are also not supported. 
Recursive stack frames in our framework are not well-suited to automation. 
'The two recursive functions we encountered both perform file-system-like tasks. 
Functions do, chmod and do. 1s are similar respectively to the permission-setting 
chmod utility, and directory-displaying 1s. The final category is functions whose 
SCF explodes. The issue occurs mostly when loops have multiple entries. 

The table in Figure 4 provides an overview of the verification effort. The 
table shows the absolute counts of functions verified as well as the total number 
of instructions for those functions. Alongside that information is the number of 
functions with loops that were verified and how many manual lines of proof were 
required in total. The vast majority of those manual proof lines were related to 
the loop count. 


5 Related Work 


Assembly verification has been an active research field for decades. Table 1 pro- 
vides an inexhaustive overview of related work. We first address some formal 
verification efforts at the assembly level. Then we discuss work in which assem- 
bly verification played a role in a larger verification context. Finally, verified 
compilation and static binary analysis tools are discussed. 
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Assembly-level Verification. Clutterbuck et al. [14] performed formal ver- 
ification of assembly code using SPACE-8080, a verifiable subset of the Intel 8080 
instruction set architecture (ISA) that is analyzable and formally verifiable [12]. 
Not long after, Bevier et al. presented a systems approach to software verification 
[5,7]. Their work laid out a methodology for verifying the correctness of all com- 
ponents necessary to execute a program correctly, including compiler, assembler 
and linker. The methodology was applied to a small OS kernel, Kit [4]. Similarly, 
Yu and Boyer [60,8] presented operational semantics and mechanized reasoning 
for approximately 80% of the instructions of the MC68020 microprocessor, over 
85 instructions. Their approach utilized symbolic execution of operational se- 
mantics. These early efforts required significant interaction. For example, the 
approach of Yu and Boyer required over 19,000 lines of manually written proof 
to verify approximately 900 assembly instructions. 


Matthews et al. targeted a simple machine model called TINY as well as 
Java virtual machine (JVM) bitcode using the M5 operational model [38]. Their 
approach utilizes symbolic execution of code annotated with manually written 
invariants. It also used verification condition generation to increase automa- 
tion. This reduced the number of manually written invariants. Both of these 
assembly-style languages feature a stack for handling scratch variables rather 
than a register file as x86, ARM, and most other mainstream ISAs do. 


Goel et al. presented an approach for modeling and verifying non-deterministic 
programs on the binary level [25,24]. In addition to formulating the semantics of 
most user-mode x86 instructions, they provided semantics for common system 
calls. System call semantics increase the spread of programs that can be fully 
verified. Their work was applied to multiple small case studies, including a word 
count program and two kernel-mode memory copying examples. 


Bockenek et al. provide an approach to proving memory usage over x86 
code [6]. They used a Floyd-style reasoning framework to prove Floyd invari- 
ants over functions [21]. They have applied it to functions of the HermitCore 
unikernel, covering 2,613 assembly instructions. Their approach required a sig- 
nificant amount of manual effort: pre- and postconditions, invariants, the actual 
regions of memory used and their relations all need to be manually defined. 


The main difference between these existing approaches and the methodol- 
ogy presented in this paper concerns automation. Generally, interactive theorem 
proving over semantics of assembly instructions does not scale due to the amount 
of intricate user interaction involved. Figure le shows, e.g., the complexity of 
defining an assembly-level invariant even for a small example. Fully automated 
approaches to formal verification, however, do not scale either. The recent au- 
tomated approach AUSPICE takes about 6 hours for a 533-instruction string 
search algorithm [56]. To the best of our knowledge, our methodology is the first 
that is able to deal with optimized x86-64 binaries produced by production code, 
with a “manual effort vs. instruction count ratio” of roughly 1 to 11. 

Myreen et al. developed decompilation-into-logic |40,41,42]. That work, de- 
veloped in the HOL4 theorem prover [54], uses operational semantics of machine 
code to lift programs into a functional form. That functional form can then be 
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Table 1: Overview of Related Work. 


Work Target Approach Applications Verified code 
Clutterbuck & Carré SPACE-8080 ITP N/A 
Bevier et al. PDP-11-like ITP Kit 
Yu & Boyer MC68020 ITP String functions 863 insts 
Matthews et al. Tiny/JVM ITP+VCG CBC enc/dec 631 insts 
Goel et al. x86-64 ITP word-count 186 insts 
Bockenek et al. x86-64 ITP HermitCore 2,613 insts 
Tan et al. ARMv7 ATP String search 983 insts 
Myreen et al. ARM/x86 | DiL seL4 9,500 SLoC 
Feng et al. MIPS-like ITP Example functions 
This paper x86-64 ITP+CG Xen 12,252 insts 
Sewell et al. C TV+DiL  seL4 9,500 SLoC 
Shi et al. C/ARM9 . ATP--MC ORIENTAIS 8,000 SLoC, 60 insts 
Dam et al. ARMv7 ATP+UC PROSPER 3,000 insts 
VCG = Verification Condition Generation DiL = Decompilation-into-Logic 
SLoC = Source Lines of Code ATP = Automated Theorem Proving 
UC = User Contracts CG = Certificate Generation 
TV = Translation Validation MC = Model Checking 


used in a Hoare logic framework for program analysis [40]. Decompilation-into- 
logic has been used for both ARM and x86 ISA machine models, and applied 
to various large examples, including benchmarks such as a garbage collector, 
and the Skein hash function. Decompilation-into-logic covers — formally — the 
gap between machine code and a HOL model. It is not a verification method in 
itself, i.e., it does not verify properties over the machine code. It can be used as 
a component in a binary-level verification methodology [51]. 

Feng et al. presented stack abstractions for modular verification of assembly 
code [20,19]. Their work allows for integration of various proof-carrying code 
systems [43]. As with our work, it utilizes a Hoare-style framework for its veri- 
fication. The authors applied their work to multiple example functions, such as 
two factorial implementations. In constrast to our approach, manual annotations 
are required to provide information regarding invariants and memory layout. 

Integrated Assembly-Level Verification Efforts. A major verification 
effort, based on decompilation-into-logic, is the verification of the seL4 ker- 
nel [32,31]. The seL4 project provides a microkernel written in formally proven 
correct, C code. The tool AutoCorres [26] is used for C code verification. Sewell 
et al. verified a refinement relation between the C source code and an ARM 
binary for both non-optimized and optimized at 02 [51]. The major differences 
with respect to our work is that our methodology targets existing production 
code, instead of code written with verification in mind. For example, the seL4 
source code does not allow taking the addresses of stack variables (such as in 
Figure 1a): their approach requires a static separation of stack and heap. Neither 
the seL4 proof effort nor our methodology support function pointers. 
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Shi et al. formally verified a real-time operating system (RTOS) for auto- 
motive use called ORIENTAIS [52]. Part of their approach involved source-level 
verification using a combination of Hoare logic and abstract communicating se- 
quential processes (CSP) model analysis [29]. Binary verification was done by 
lifting the RTOS binary to xBIL, a related hardware verification language [53]. 
They translated requirements from the OSEK automotive industry standard to 
source code annotations. 


Targeting a similar case study as this paper, Dam et al. formally verified a 
tiny ARMv7 hypervisor, PROSPER [16,3] at the assembly level. Their methodol- 
ogy integrated HOL4 with the Binary Analysis Platform (BAP) [9]. BAP utilizes 
a custom intermediate language that provides an architecture-agnostic represen- 
tation of machine instructions and their side effects. HOL4 was used to translate 
the ARM binary into BAP’s intermediate language, using the formal model of 
the ARM ISA by Fox et al.[22]. The SMT solver Simple Theorem Prover (STP) 
[23] was used to determine the targets of indirect branches and to discharge the 
generated verification conditions. While the approach was generally automated, 
user input was still required to describe software contracts of the hypervisor. 


Verified Compilation. In contrast to directly verifying machine or assem- 
bly code, one can verify source code and then use verified compilation. Verified 
compilation establishes a refinement relation between assembly and source code. 
The CompCert project [36] provides a compiler for a subset of C. Its output has 
been verified to have the same semantics as the C source code. The seL4 project 
used CompCert to reduce its trusted code base [31]. Another example of verified 
compilation is CakeML [35]. It utilizes a subset of Standard ML modeled with 
big-step operational semantics. The main purpose of verified compilation, how- 
ever, is not to verify properties over the code. For example, if the source code is 
vulnerable to a return-address exploit, then the assembly code is vulnerable as 
well. Verified compilation is thus often accompanied by source code verification. 
We have argued that for memory usage, assembly-level verification is necessary. 


Static Analysis. Static analysis of binary code has been an active research 
field for decades [34,9,58]. The BitBlaze project [55] provides a tool called Vine 
which constructs control flow graphs for supplied programs and lifts x86 instruc- 
tions to its own intermediate language (IL). Though Vine itself is not formally 
verified, it does support interfacing with the SMT solver STP as well as CVC 
[1,2]. The tool Infer [10], developed at Facebook, provides in-depth static analy- 
sis of LLVM code to detect bugs in C and C++ programs. It utilizes separation 
logic [47] and bi-abduction [11] to perform its analyses in an automated fashion. 
It is designed to be integrated into compiler toolchains, in order to provide im- 
mediate feedback even in continuous integration scenarios. FindBugs is a static 
analysis tool for Java code [30]. Rather than relying on formal methods, it uses 
searches for common code idioms to detect likely bugs. Common errors it high- 
lights include null pointer dereferences, objects that compare equal not having 
equal hash codes, and inconsistent synchronization. The tool Splint [18] detects 
buffer overflows and similar potential security flaws in C code. It relies on anno- 
tated preconditions to derive postconditions. 
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The main difference between these static analysis tools and formal verification 
is that these tools generally are highly suited to find bugs, but are not able 
to prove absence of them. They generally apply techniques that are formally 
unsound, such as depth-bounded searches. 


6 Conclusion 


This paper presents an approach to formal verification of memory usage of func- 
tions in a compiled program. Memory usage is a property that expresses an 
overapproximation of the memory used by assembly code. Memory usage is fun- 
damental to compositional verification of assembly code, as compositionality at 
least requires to prove that functions do not unexpectedly interfere with each 
others’ stack frame. It can also be used to show security-related properties, such 
as integrity of the return address. 

Our approach automatically generates a formal memory usage certificate that 
includes 1.) a set of memory regions read from and written to, 2.) postconditions 
that express sanity constraints over the function (e.g., the return address has not 
been overwritten, callee-saved registers are restored), 3.) proof ingredients such 
as the preconditions necessary for formal verification. The certificate is loaded 
into a theorem prover, where it is verified. Since the problem of memory usage 
is undecidable, we use an interactive theorem prover. The proof ingredients, 
combined with custom proof strategies, provide a large degree of automation. 
They deal with memory aliasing, the control flow of the function, and invariants. 

The approach is applied to three binaries of the Xen hypervisor. These bina- 
ries contain production code and are the result of a complex build chain. They 
contain, among others, various nested loops, large and compound data struc- 
tures, variadic functions, and both in- and external function calls. For 71% of 
the functions of these binaries, a certificate could be generated and verified. For 
each of these functions, it has at least been formally proven that the return ad- 
dress is not overwritten. The amount of user interaction is roughly 85 lines of 
proof code per 1,000 lines of assembly code. The greatest bottleneck is in indirect 
branching, which accounts for 19% of the functions. 

In the near future we aim to support indirect branching. This would allow 
support of switches, callbacks, and pointers to functions. Additionally, we aim to 
strengthen the invariant generation. Stronger invariants lead to a tighter overap- 
proximation of memory usage. The challenge here is not only to generate these 
invariants, but to automate their proof as well. Finally, we want to leverage the 
certificate to target high-level security properties, such as noninterference. 
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Abstract. We present the main concepts, components, and usage of 
GASOL, a Gas AnalysiS and Optimization tooL for Ethereum smart con- 
tracts. GASOL offers a wide variety of cost models that allow inferring 
the gas consumption associated to selected types of EVM instructions 
and/or inferring the number of times that such types of bytecode in- 
structions are executed. Among others, we have cost models to measure 
only storage opcodes, to measure a selected family of gas-consumption 
opcodes following the Ethereum's classification, to estimate the cost of 
a selected program line, etc. After choosing the desired cost model and 
the function of interest, GASOL returns to the user an upper bound of 
the cost for this function. As the gas consumption is often dominated 
by the instructions that access the storage, GASOL uses the gas analysis 
to detect under-optimized storage patterns, and includes an (optional) 
automatic optimization of the selected function. Our tool can be used 
within an Eclipse plugin for Solidity which displays the gas and instruc- 
tions bounds and, when applicable, the gas-optimized Solidity function. 


1 Introduction and Main Applications 


Ethereum [27] is a global, open-source platform for decentralized applications 
that has become the world's leading programmable blockchain. As other block- 
chains, Ethereum has a native cryptocurrency named Ether. Unlike other block- 
chains, Ethereum is programmable using a Turing complete language, i.e., de- 
velopers can code smart contracts that control digital value, run exactly as pro- 
grammed, and are immutable. A smart contract is basically a collection of code 
(its functions) and data (its state) that resides at a specific address on the 
Ethereum blockchain. Smart contracts on the Ethereum blockchain are metered 
using gas. Gas is a unit that measures the amount of computational effort that 
it will take to execute each operation. Every single operation in Ethereum, be it 
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a transaction or a smart contract instruction execution, requires some amount of 
gas. The gas consumption of the Ethereum Virtual Machine (EVM) instructions 
is spelled out in [27]; importantly, instructions that use replicated storage are 
gas-expensive. Miners get paid an amount in Ether which is equivalent to the 
total amount of gas it took them to execute a complete operation. The rationale 
for gas metering is threefold: (i) Paying for gas at the moment of proposing the 
transaction prevents the emitter from wasting miners computational power by 
requiring them to perform worthless intensive work. (ii) Gas fees disincentive 
users to consume too much of replicated storage, which is a valuable resource 
in a blockchain-based consensus system (this is why storage bytecodes are gas- 
expensive). (iii) It puts a cap on the number of computations that a transaction 
can execute, hence prevents DoS attacks based on non-terminating executions. 

Solidity [13] is the most popular language to write Ethereum smart contracts 
that are then compiled into EVM bytecode. The Solidity compiler, solc, is able 
to generate only constant gas bounds. However, when the bounds are parametric 
expressions that depend on the function parameters, on the contract state, or on 
the blockchain state (according to the experiments in [8] this happens in almost 
10% of the functions), named solc, returns oo as gas bound. This paper presents 
GASOL [6], a resource analysis and optimization tool that is able to infer para- 
metric bounds and optimize the gas consumption of Ethereum smart contracts. 
GASOL takes as input a smart contract (either in EVM, disassembled EVM, or 
in Solidity source code), a selection of a cost model among those available in 
the system (c.f. Section 2), and a selected public function, and it automatically 
infers cost upper bounds for this function. Optionally, the user can enable the 
gas optimization option (c.f. Section 3) to optimize the function w.r.t. storage 
usage, a highly valuable resource. GASOL has a wide range of applications: (1) 
It can be used to estimate the gas fee for running transactions, as it soundly 
over-approximates the gas consumption of functions. (2) It can be used to cer- 
tify that the contract is free of out-of-gas vulnerabilities, as our bounds ensure 
that if the gas limit paid by the user is higher than our inferred gas bounds, 
the contract will not run out-of-gas. (3) As an attacker, one might estimate, how 
much Ether (in gas), an adversary has to pour into a contract in order to execute 
an out-of-gas attack. Also, attacks were produced by introducing a very large 
number of underpriced bytecode instructions [23]. Our cost models could allow 
detecting these second type of attacks by measuring how many instructions will 
be executed (that should be very large) while its associated gas consumption 
remains very low. (4) As we will show in the paper, the gas analysis can be used 
to detect gas-expensive fragments of code and automatically optimize them. 


2 Gas Analysis using Gasol 


Figure 1 overviews the components of the GASOL tool [6]. The programmer 
can use GASOL during the software development process from its Eclipse plugin 
that allows selecting the cost model of interest and the function to be analyzed 
and/or optimized from the Outline. This selection together with the compiled 
EVM code is sent to the gas analyzer. A technical description of all phases 
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Fig. 1. Overview of GASOL’s components 


that comprise a gas analysis for EVM smart contracts is given in [8]. Basically, 
the analyzer uses various tools [3,7] to extract the CFGs and decompile them 
into a high-level representation from which upper bounds (UB) are produced by 
using extensions of resource analyzers and solvers [4,5]. However, in our basic 
gas analyzer named GASTAP [8], there was only one cost model to compute the 
overall gas consumption of the function (including the opcode and memory gas 
costs [27]), while GASOL is an extension of GASTAP that introduces optimization, 
a wide variety of analysis options to define novel cost models, and an Eclipse 
plugin. The UBs are provided to the user in the console as well as in markers 
for functions within the Eclipse editor. If the user had selected the optimization 
option, the analyzer detects potential sources of optimization and feeds them to 
the optimizer to generate an optimized Solidity function within a new file. 

Fig. 2 displays our Eclipse plugin that contains a fragment of the public 
smart contract ExtraBalToken [1] used as running example. We can see its six 
state variables and its function fill that we will analyze and optimize. The right 
side window shows GASOL’s configuration options to set up the cost model: 

(i) Type of resource (gas/instructions): by selecting gas, we estimate the gas 
consumption according to the gas model in [27] (hence, use GASOL as a gas ana- 
lyzer); by selecting instructions, we estimate the number of bytecode instructions 
executed (using GASOL as a standard complexity analyzer). 

(à) Type of instructions: allows selecting which instructions (or group of instruc- 
tions) will be measured as follows. 

- All: every bytecode instruction will be measured. For instance, by selecting gas 
in (i), the function fill, and this option, we obtain as gas bound: 1077 + 40896 - 
data. Besides, by using this option, GASOL also yields the so-called memory gas 


(see[27]): 3-(data+5)+ | eee | . The analyzer abstracts arrays by their length, 


hence, these bounds are functions of the length of the input array (denoted as 
data) and can be used, e.g., to determine precisely how much gas is necessary 
to run a transaction that executes this function. 

- Gas-family: [27] classifies bytecode instructions according to their gas consumed 
in six groups: zero, base, verylow, low, mid and high. Instructions that do not 
belong to any of the previous groups are considered as single families. This option 
provides the cost due to each gas-family separately and, by using the filter in (iii), 
we can type the name of the desired group(s). As an example, for the function 
fill using gas in (i), we obtain gas bounds 297 + 315 - data and 16 + 8 - data for 
the gas-families verylow and mid, resp. 
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Fig. 2. Excerpt of smart contract ExtraBalToken in Solidity within Eclipse plugin. 


- Storage: only the instructions that access the storage (namely bytecodes SLOAD 
and SSTORE) are accounted. The gas bounds displayed within the Eclipse console 
in Fig. 2 correspond to this setting, where we can see that the gas due to the 
access of each basic storage variable is shown separately. The first row unknown 
accumulates the gas of all accesses to non-basic types (data structures) as we 
still cannot identify them. By comparing this storage gas with the overall gas 
bound shown above for All, we can observe that most of the gas consumed by 
the function is indeed dominated by the storage (namely 40.000 out of 40.896 at 
each loop iteration) and it is thus a target for optimization (see Sec. 3). 

- Storage-optimization: it bounds the number of SLOAD and SSTORE instructions 
executed by the current function (excluding those in transitive calls). It is the 
cost model that is used to detect and carry out the optimization described in 
Sec. 3. Thus, it is the only selection that enables the Gas optimization that ap- 
pears as third option, and forces the selection of “instructions” as type of resource 
in (1). We obtain for the state variable totalSuply the bound: 2-data, which cap- 
tures that we execute two accesses (one read, one write) to field totalSuply at 
each loop iteration. 

- Line: this option allows specifying the line number (of the Solidity program) 
whose cost will be measured, and the remaining lines will be filtered out. For 
instance, if the line number specified in the filter (iii) is 17, i.e., the Solidity 
instruction: uint amount — data[i]/D160, the obtained gas bound is 34-97-data. 
In the absence of number in the filter, the bounds are given separately for all 
program lines. This option is intended to help the programmer in improving the 
gas consumption of her code by trying out different implementation options and 
comparing the results. 

- Selected: allows computing the consumption associated to each different EVM 
instruction separately. For instance, if we select the bytecode instructions MLOAD 
and SHA3, we obtain the gas bounds 6+15-data and 84-data resp. As in the 
previous option, the filter allows the user to select the instructions of interest 
and filter out the remaining. 
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(iii) Filter: this is a text field used to filter out information from the UBs. For 
gas-family, the user can specify low, mid, etc. For storage, it allows specifying the 
name of the basic field(s) whose storage will be measured. For line and selected, 
we can type the line numbers and names of bytecode instructions of interest. 
Once all options have been selected, we have set up a cost model that is sent 
together with the EVM code to the gas analyzer and, after analysis, it outputs 
an UB for the selected function w.r.t. the cost model activated by the options. 
This UB is displayed, as shown in Fig. 2 in the console of the Eclipse plugin, 
and also within markers next to the function definition. 


3 Gas Optimization using Gasol 


The information yield by the gas analysis is used in GASOL to detect potential 
optimizations. Currently, the optimization target is the reduction of the gas con- 
sumption associated to the usage of storage. In particular, we aim at replacing 
multiple accesses to the same (global) storage data within a fragment of code 
(each write access costs 20.000 in the worst case and 5.000 in the best case) by 
one access that copies the data in storage to a (local) memory position followed 
by accesses to such memory position (an access to the local memory costs only 
3) and a final update to the storage if needed. The cost model number of in- 
structions for storage-optimization described in Sec. 2 allows us to detect such 
storage optimizations, namely for each different field, if we get a bound that is 
different from one, we know that there may be multiple accesses to the same po- 
sition in the storage and we try to replace them by gas-efficient memory accesses. 
Our transformation is done at the level of the Solidity code, by defining a local 
variable with the same name as the state variable to transform, and introduc- 
ing setter and getter functions to access the storage variable. Currently, we can 
transform accesses to variables of basic types, in the future, we plan to extend 
it to data structures (maps and arrays). The number of instructions bound for 
field totalSupply is 2- data (hence 4 1), and our optimization of fill is: 


" " à 8 uint amount = data[i D160; 

1 function fill (uint [] data) { . à if (balanceOf[a] ane 

2 uint256 totalSupply = get-_field_totalSupply (); 2 balanceOt[a] = amount: 

3 A 

4 if ((msg.sender !— owner)||(sealed )) s totalSupply += amount; 

5 throw; » 

6 for (uint i—0; i<data.length; i++) { . . 
7 address a — address( data[i] & (D160—1) ); pe set-fieldtotalSupply (totalSupply); 


15 } 


The gas bound (using the option All) for the optimized fill yield by GASOL is 
21368 + 20674 - data, which means that, assuming the worst case for write access 
to storage, the gas consumed inside the loop is 49.4596 smaller than the one for 
the original fill function (the memory gas does not change). Note that, even if 
we consider the best case of 5.000 for write access to storage for the accesses we 
have optimized, the gas reduction is still around 2096. This is, in fact, what we 
have manually estimated using the actual data of the 82 times this function has 
been executed in the Ethereum blockchain, achieving with GASOL a total saving 
of almost 60M gas. As our transformation is local to the function, in order to 
be sound, we check that the transformed global data is not being accessed by 
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transitive calls. For instance, if there was a call to another function from function 
fill that accesses totalSupply, we would not transform it. Besides, for efficiency, 
we check if all accesses are read (bytecode SLOAD) and, in such case, we do not 
need to invoke the setter at the end (and avoid an unnecessary write access). 


4 Related Tools and Conclusions 


Numerous tools are being developed to catch different types of vulnerabilities of 
smart contracts [20,16,22,19,17,26,18,10,15,9]. As mentioned in Sec. 1, the Solid- 
ity compiler solc is not able to give any gas estimation for the running example, 
as its gas consumption is not constant. Therefore, new gas analysis tools are be- 
ing developed to detect potential gas related vulnerabilities and to infer bounds 
in these complex situations. The purpose of the GASPER and MADMAX tools is 
precisely the detection of gas related vulnerabilities. MADMAX [14] focuses on 
identifying control- and data-flow patterns inherent for the gas-related vulnera- 
bilities, thus, it works as a bug-finder, rather than as a gas analyzer like GASOL. 
Similarly, GASPER identifies gas-costly programming patterns [12] by matching 
specific control-flow patterns and using SMT solvers and symbolic computation. 
Thus, it is an optimization detector, not an automatic optimizer as GASOL. The 
recently developed ebso tool [24] also aims at optimizing the gas consumption 
of EVM code. In contrast to GASOL, ebso's optimizations are limited to a basic 
block level, while our transformation might involve several blocks of the CFG 
and would not be achievable by ebso's approach. Also, ebso is not guided by 
the results of an automatic resource analysis which can capture the expensive 
storage patterns as in our case. Instead it is based on a full exploration of all 
possible alternative instructions (within the considered block) that would lead to 
the same result and consume less gas. They have obtained a number of rewrite 
rules that define sequences of bytecode instructions that can be replaced by 
equivalent ones that consume less. We could easily incorporate such basic block 
replacement optimizations within our tool, and it is part of our agenda. 

The approach of [21], like ours, aims at inferring precise gas bounds. Their 
approach is based on symbolically enumerating all execution paths [11] and 
unwinding loops to a limit. Instead, using resource analysis, GASOL infers the 
maximal number of iterations for loops and generates accurate gas bounds which 
are valid for any possible execution of the function and not only for the unwound 
paths. The approach by Marescotti et al. has not been implemented in the con- 
text of EVM and a tool like GASOL has not been delivered. An orthogonal line of 
work with ours is the construction of resource-oriented attacks [23] that exploit 
the weaknesses of the EVM gas model. GASOU's cost models could help detect 
this resource-oriented attacks by estimating the number of executed bytecode 
instructions (very high) and their associated gas consumption (very low). 

Finally, there is a tendency to define new languages (see Scilla [25], Michelson 
[2]) for programming smart contracts that provide certain safety guarantees, e.g., 
Scilla [25] provides predictable gas consumption by disallowing general recursion 
and while-loops. However, Ethereum is today the most widely used blockchain, 
and Solidity the most popular programming language to write Ethereum smart 
contracts, for which a gas analyzer-J-optimizer is of clear relevance. 
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Abstract. Verification algorithms are among the most resource-intensive 
computation tasks. Saving energy is important for our living environment 
and to save cost in data centers. Yet, researchers compare the efficiency of 
algorithms still in terms of consumption of CPU time (or even wall time). 
Perhaps one reason for this is that measuring energy consumption of 
computational processes is not as convenient as measuring the consumed 
time and there is no sufficient tool support. To close this gap, we contribute 
CPU Enercy Meter, a small tool that takes care of reading the energy 
values that Intel CPUs track inside the chip. In order to make energy 
measurements as easy as possible, we integrated CPU Enercy Meter into 
BencuExec, a benchmarking tool that is already used by many researchers 
and competitions in the domain of formal methods. As evidence for 
usefulness, we explored the energy consumption of some state-of-the-art 
verifiers and report some interesting insights, for example, that energy 
consumption is not necessarily correlated with CPU time. 


Keywords: Energy Measurement - RAPL - Benchmarking - BenchExec 


1 Introduction 


'There is a strong demand to save electrical energy, of which nowadays a large 
portion is used by computational processes. Most importantly, we need to protect 
the environment that we live in, but we also need to consider that energy usage 
is one of the most important cost factors in data centers: after computing devices 
are purchased and installed, the operational cost is dominated by the cost of 
consumed electrical energy. And since most of the used electrical energy is turned 
into heat energy, there is follow-up cost for the cooling system, which sets the 
limits of used energy for each rack in a data center [16]. 

In order to control energy consumption, we first need to measure it. Work in 
the area of green software engineering identified a lack of data and insufficient 
tool support [12]. Energy consumption of an algorithm is often reduced to CPU 
time, which seems to be a natural choice at a first look, but after more accurate 
measurement we know that this reduction leads to wrong conclusions. 

Why is energy usage of verification algorithms not measured but only CPU 
time? Most likely it is technically too difficult for researchers to measure energy 
consumption, because it would require external hardware that is not common or 
because internal energy measurements are not well-known and complex to use. 


© The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 126-133, 2020. 
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In order to provide a solution to this problem, we contribute an open-source 
lightweight tool that enables convenient energy measurement for a large range of 
modern CPUs. The tool CPU ENERGY METER makes it easy and convenient to 
access energy measurements done by the CPU for various of its parts. Furthermore, 
we integrate energy measurement in the benchmarking framework BENCHEXEC, 
which is widely used by researchers and competitions (e.g., [2]). 

Using CPU ENERGY METER does not require any extra hardware, but accesses 
the existing feature for energy measurement called RAPL that Intel CPUs provide. 
This convenience comes with a limitation: We can only access measurement values 
for those parts of the computing board that the CPU measures, but no external 
equipment, such as hard drives and the power supply itself. 


Related Work. Energy measurements should be used for algorithm engineer- 
ing [1], and there is a strong need for tool support, such as PowerPack [8]. RAPL 
is being studied as a measurement method for energy consumption [6, 9, 10, 13, 17], 
and energy measurements that are based on RAPL are being developed for specific 
scenarios [11, 15, 18, 19] and used to evaluate algorithms [7]. CPU ENERGY METER 
makes energy measurement conveniently accessible to verification researchers. 
'The most closely related project is the Performance API (PAPI) analysis library, 
which also supports RAPL [19], but this is a large library with a much larger 
scope than just energy measurements. In contrast, our tool is a ready-to-use 
solution for energy measurements that is easy to install and use. 


2 Intel Running Average Power Limit (RAPL) 


The Intel Running Average Power Limit (RAPL) [14] is a feature of Intel CPUs 
that allows to measure and limit the energy consumption of CPUs. It is available 
since the 2"¢ generation of the Intel Core architecture (code name “Sandy Bridge"), 
i.e., on Intel Core 13/15/17 2000 and newer, as well as Intel Xeon E3/E5/E7 CPUs. 
This covers a wide range of common CPUs for notebooks, desktops, and servers. 

One part of RAPL consists of access to a series of hardware counters in which 
the CPU accumulates the energy it has consumed. RAPL supports measuring the 
energy consumption of so-called “domains”, and up to five domains are supported 
by current CPUs: package, PPO, PP1, DRAM, and PSYS. Which hardware units 
are included in which domain is not clearly specified by Intel, but in general we 
can use the following assumption: The package domain refers to the whole CPU, 
the PPO domain refers to the processor cores, and the PP1 domain refers to other 
units such as an integrated graphics unit. The domains DRAM and PSYS may 
provide information on the energy consumption of the RAM and other hardware 
on the mainboard, but both need special support from the hardware platform 
and its values may not be comparable between different systems. 

There is no official information by Intel on the precision of the measurements 
except that the counters are updated approximately every 1 ms. The resolution 
of the values varies between the CPUs, but is typically gis J or su J, ie., in 
the order of 107? J. For the first generation of CPUs with RAPL, the energy 
consumption was approximated by the CPU and imprecise, but for subsequent 
generations the precision had been improved [6, 7, 10]. 
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3 CPU ENERGY METER 


Our tool CPU ENERGY METER provides access to the energy-measurement features 
of Intel CPUs to users. It was developed based on the tool Intel Power Gadget for 
Linux !. Our tool is available as open source under the permissive 2-clause BSD 
license and hosted on GitHub ?. Installation packages of CPU ENERGY METER 
are available for Debian-based distributions (e.g., Ubuntu). 

CPU ENERGY METER measures the energy consumption of the CPU(s) of a 
system for a specific time interval as reported by the RAPL interface (cf. Sect. 2). 
In order to ensure the highest possible measurement precision with the lowest 
possible overhead, it reads the RAPL energy counters as rarely as possible instead 
of using continuous sampling, while at the same time reading the counters often 
enough to safely detect and account for counter overflows. Furthermore, our 
tool was developed to use a minimal amount of necessary dependencies and 
permissions in order to make its installation as easy as possible. 


Requirements. CPU ENERGY METER requires a system with one or more Intel 
CPUs that support the RAPL feature. It needs direct access to the CPUs, thus 
running in a virtual machine is not supported. Accessing the model-specific regis- 
ters of CPUs with the energy measurements is done via the Linux kernel module 
msr?, which needs to be loaded and provides device files named /dev/cpu/*/msr. 
Typically, access to these device files is granted only to the user root. In order 
to not need to execute CPU ENERGY METER as root, one can change the file 
permissions of the device files appropriately (e.g., by granting read permissions to 
a group msr and making CPU ENERGY METER always execute as this group using 
the "setgid" permission). Furthermore, CPU ENERGY METER needs the capability 
CAP. SYS. RAWIO 4, which can be granted using setcap °. The installation packages 
of CPU ENERGY METER attempt to automatically configure the system such 
that every user can execute the tool without granting any other non-standard 
permissions to users. In any case (whether executed as root or not), CPU ENERGY 
METER drops all unnecessary permissions as soon as possible using the library 
“libcap” © in order to reduce any risk related to the non-standard permissions. 


Usage. CPU ENERGY METER is intended primarily to be used by benchmarking 
frameworks, however, manual execution is also possible. When the tool is executed, 
it starts the measurements and prints the consumed energy for all supported 
domains and CPUS of the system as soon as it is killed via the interrupt signal 
or Ctrl+C. Intermediate measurements are printed when the signal USR1 is 
received. To manually measure the energy consumption of the duration of a 
specific command, one can execute the following command line, for example: 


cpu-energy-meter & some command ; kill -INT %1 


1 https:/ /software.intel.com/en-us/articles/intel-power-gadget 
? https: / /github.com/sosy-lab/cpu-energy-meter 

3 http://man7.org/linux/man-pages/man4/msr.4.html 

^ http: //man7.org/linux/man-pages/man7 /capabilities.7.html 
5 http://man7.org/linux/man-pages/man8/setcap.8.html 

9 https:/ /sites.google.com/site/fullycapable/ 
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This will measure the energy consump- +----------------------------- + 
tion of all CPUs during the whole time that | CPU Energy Meter Socket 0 | 
the specified command is running, regard- *----------------------------- * 
less of whether this energy consumption is Duration 9.990624 sec 
caused by the specified command or by other Package 15.898926 Joule 
processes running in parallel (this is a limita- e ed Mind 
tion of the RAPL feature). Thus, measuring PRAM iibamdi Joule 
the energy consumption during a specific paye 104.778931 Joule 


time period (e.g., 10s) can be done by re- 
placing some, command with sleep 10. 
'The output values are given with the 
unit Joule, and can be formatted either in 
a way that is optimized for being read by 
humans (cf. Fig. 1) or parsed by programs. 


Fig.1: Example output of CPU 
ENERGY METER on a single-CPU 
system of the SkyLake generation 
(with all five domains supported) 


Integration into BENcHExEC. We have contributed an integration of CPU 
ENERGY METER into the benchmarking framework BENCHExEc [4], because 
BENCHEXEc is widely used in the formal-methods community (e.g., SV-COMP [2]). 
Starting with version 1.16, BENCHEXEc automatically executes CPU ENERGY 
METER if the latter is installed, and it reports the energy results in the same 
manner as the results of its internal time and memory measurements ( BENcHExEC 
supports the creation of CSV tables and interactive HTML tables with plots for 
its benchmarking results). BENcHExsc will report the energy consumption only 
if all cores of one or more CPUs are used for each tool execution, because we 
cannot distinguish between the energy consumption of individual processes. 


4 Applications 


The 8* International Competition on Software Verification (SV-COMP’19) [3] 
measured energy consumption of verification tools using BENcHExEc and CPU 
ENERGY METER and for the first time provided an alternative “green” ranking 
based on energy efficiency (CPU-energy usage divided by achieved score). This 
ranking was indeed considerably different from the main score-based ranking, 
with no overlap between the top three green verifiers and the top three verifiers 
in the category *C-Overall". Furthermore, the winner in the green ranking is two 
orders of magnitude more efficient than the last tool in the ranking (64J per 
score point vs. 4200 J per score point). This shows an enormous potential of 
efficiency improvements and energy savings if verification researchers get access 
to easy measurements of energy usage. 

In the following, we analyze in more detail some energy measurements of 
SV-COMP’19, which provides all raw results online ^. We pick the results for 
the submissions CBMc? and CPA-SEQ® across all categories. CPA-SEQ is the 
winner of the category “C-Overall”, written in Java, and employs several different 
algorithms, some of which are partially parallelized. The garbage collector that 


T https:/ /sv-comp.sosy-lab.org/2019 /results/results-verified/ All- Raw.zip 
8 http://www.cprover.org/cbmc/ ° https:/ /cpachecker.sosy-lab.org/ 
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Table 1: Selection of Energy Measurements from SV-COMP’19 
CBMC CPA-SEQ 


RAPL domain Package PPO (Core) DRAM Package PPO (Core) DRAM 


Average power used per task with regard to wall time (energy divided by wall time): 


Min (W) 1.9 1.2 0.63 4.4 3.4 1.6 
Max (W) 25 24 5.5 36 35 7.2 
Avg (W) 9.7 8.8 2.4 20 19 2.8 
Std. Dev. (W) 3.2 3.2 0.71 6.2 6.2 0.48 
Average power used per task with regard to CPU time (energy divided by CPU time): 
Min (W) 1.8 Tsi 0.58 4.2 3.2 0.70 
Max (W) 23 22 5.5 17 16 6.8 
Avg (W) 9.6 8.7 2.4 9.6 9.0 1.5 
Std. Dev. (W) 3.1 3.1 0.74 1.8 1.7 0.60 


is used by the JVM adds some more parallelism. CBMc is written in C++ and 
uses bounded model checking in a strictly sequential implementation. Thus, we 
expect that the energy consumption of these tools has different characteristics. 
SV-COMP’19 executed both tools for 10522 tasks (CPU-time limit 900s per 
task, Intel Xeon E3-1230 v5 CPU, quad-core with hyper-threading, 3.4 GHz, all 
8 processing units of the CPU and 15GB of memory were available to each tool 
execution, Ubuntu 18.04 64-bit with Linux kernel 4.15 was the operating system). 
We now compare the energy consumption of the RAPL domain “Package” 
with the CPU time for Cpmc in Fig. 2 and for CPA-Szo in Fig. 3.!? In the plot, 
all results that lie on the same line through the origin belong to tool executions 
for which the energy consumption per second of CPU time (in J = W) was the 
same (this would be the average power of the CPU if measuring wall time instead 
of CPU time). We provide additional statistics in Table 1 and two graphs that 
compare the CPU time and the energy consumption of the two tools in Fig. 4. 


Insight: Also for verification tools, high values for CPU time do not imply high 
values for energy. Figure 2 has a large vertical area of data points where the 
CPU time is close to the time limit. For those verification runs, the energy is in 
the range of 2.0 kJ to 15 kJ. This shows that for a specific CPU time, the energy 
consumption (and average power, cf. Table 1) for different verification tasks can 
vary by a factor of 7. 


Insight: Comparing different verification tools regarding CPU time can lead to 
different conclusions than energy-based comparisons. The graph on the left of 
Fig. 4 compares CBMc and CPA-SEq regarding CPU time, the graph on the 
right compares them regarding energy consumption. The difference between the 
shapes of these two graphs shows that looking at the energy consumption when 
comparing tools is an interesting addition to comparing only CPU time, and that 


10 For CPA-Szo, the CPU time is sometimes higher than 900s because SV-COMP lets 
tools optionally run for more than the time limit in order to print additional statistics 
(but any result after the time limit is of course discarded). 
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the similar statistics on power usage with regard to CPU time (cf. lower part of 
Table 1 and Figs. 2 and 3) can be misleading: if the power-usage characteristics 
of both tools were the same, the two graphs in Fig. 4 would look similar. 


5 Conclusion 


Verification algorithms consume large amounts of energy and thus, it is prohibitive 
to ignore the energy characteristics of algorithms when comparing their quality. 
Although this matter is understood, the verification community does not measure 
energy. We believe that this is because measurement of energy is complex and 
requires a lot of additional effort. The lightweight tool CPU ENERGY METER 
fills this gap: It supports reading Intel-RAPL-based energy measurements in a 
convenient way and —via integration into BENCHEXxEc— using a tool environment 
that many verification researchers use anyway already. 

An analysis of a large data set from a verification competition invalidates a 
wide-spread assumption: the data quickly reveal that energy consumption can 
deviate significantly from the consumed CPU time. Thus, it is not sufficient to 
measure CPU time. 
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Data Availability Statement. A replication package for this article including 
CPU Enercy METER and BencHExec is available at Zenodo [5]. Current ver- 
sions of CPU Enercy METER are available at https://github.com/sosy-lab/ 
cpu-energy-meter and https://doi.org/10.5281/zenodo.1300309. The dataset 
from SV-COMP’19 [3] that was analyzed in Sect. 4 is available online at 
https:/ /sv-comp.sosy-lab.org/2019/results/results-verified/ All- Raw.zip. 


References 


1. Bekas, C., Curioni, A.: A new energy aware performance metric. Computer Science 
- R&D 25(3-4), 187-195 (2010). https://doi.org/10.1007/s00450-010-0119-z 

2. Beyer, D.: Reliable and reproducible competition results with BENCHEXEC and 
witnesses (Report on SV-COMP 2016). In: Proc. TACAS. pp. 887-904. LNCS 9636, 
Springer (2016). https: //doi.org/10.1007/978-3-662-49674-9 55 

3. Beyer, D.: Automatic verification of C and Java programs: SV-COMP 
2019. In: Proc. TACAS (3). pp. 133-155. LNCS 11429, Springer (2019). 
https:/ /doi.org/10.1007/978-3-030-17502-3 9 

4. Beyer, D., Lowe, S., Wendler, P.: Reliable benchmarking: Requirements 
and solutions. Int. J. Softw. Tools Technol. Transfer 21(1), 1-29 (2019). 
https:/ /doi.org/10.1007/s10009-017-0469-y 

5. Beyer, D., Wendler, P.: Replication package for article ‘CPU Energy Meter: A 
tool for energy-aware algorithms engineering’ in Proc. TACAS ’20. Zenodo (2020). 
https:/ /doi.org/10.5281/zenodo.3679337 

6. Desrochers, S., Paradis, C., Weaver, V.M.: A validation of DRAM RAPL power 
measurements. In: Proc. Int. Symposium on Memory Systems (MEMSYS). pp. 
455-470. ACM (2016). https://doi.org/10.1145/2989081.2989088 

7. Dongarra, J.J., Ltaief, H., Luszczek, P., Weaver, V.M.: Energy footprint of advanced 
dense numerical linear algebra using tile algorithms on multicore architectures. In: 
Proc. Int. Conference on Cloud and Green Computing (CGC). pp. 274-281. IEEE 
(2012). https: //doi.org/10.1109/CGC.2012.113 

8. Ge, R., Feng, X., Song, S., Chang, H., Li, D., Cameron, K.W.: Pow- 
ERPACK: Energy profiling and analysis of high-performance systems and 
applications. IEEE Trans. Parallel Distrib. Syst. 21(5), 658-671 (2010). 
https: / /doi.org/10.1109/TPDS.2009.76 

9. Hackenberg, D., Ilsche, T., Schöne, R., Molka, D., Schmidt, M., Nagel, W.E.: Power 
measurement techniques on standard compute nodes: A quantitative comparison. 
In: Proc. Int. Symposium on Performance Analysis of Systems & Software (ISPASS). 
pp. 194-204. IEEE (2013). https://doi.org/10.1109/ISPASS.2013.6557170 

10. Hackenberg, D., Schóne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An 
energy efficiency feature survey of the Intel Haswell processor. In: Proc. Int. Par- 
allel and Distributed Processing Symposium (IPDPS). pp. 896-904. IEEE (2015). 
https:/ /doi.org/10.1109/IPDPSW.2015.70 

11. Hahnel, M., Dóbel, B., Vólp, M., Hartig, H.: Measuring energy consumption for 
short code paths using RAPL. SIGMETRICS Performance Evaluation Review 
40(3), 13-17 (2012). https:/ /doi.org/10.1145/2425248.2425252 

12. Hindle, A.: Green software engineering: The curse of methodology. Tech. Rep. 
4:e1470v2, PeerJ PrePrints (2016). https://doi.org/10.7287 /peerj.preprints.1470v2 

13. Ilsche, T., Hackenberg, D., Graul, S., Schóne, R., Schuchart, J.: Power measurements 
for compute nodes: Improving sampling rates, granularity and accuracy. In: Proc. 


CPU Energy Meter: A Tool for Energy-Aware Algorithms Engineering 133 


Int. Green and Sustainable Computing Conference (IGSC). pp. 1-8. IEEE (2015). 
https: //doi.org/10.1109/IGCC.2015.7393710 

14. Intel: Intel 64 and IA-32 architectures software developer's manual, vol. 3B, 
chap. 14.9 (December 2017), available at https:/ /software.intel.com/en-us/articles/ 
intel-sdm 

15. Khan, K.N., Ou, Z., Hirki, M., Nurminen, J.K., Niemi, T.: How much power does 
your server consume? Estimating wall socket power using RAPL measurements. 
Computer Science - R&D 31(4), 207-214 (2016). https://doi.org/10.1007/s00450- 
016-0325-4 

16. Scaramella, J., Eastwood, M.: Solutions for the data center's thermal challenges. 
Tech. rep., IDC (2007), available at https:/ /www-935.ibm.com/services/fr/igs/pdf/ 
idc opinion coolblue wp.pdf 

17. Schuchart, J., Hackenberg, D., Schóne, R., Ilsche, T., Nagappan, R., Patterson, 
M.K.: The shift from processor power consumption to performance variations: 
Fundamental implications at scale. Computer Science - R&D 31(4), 197-205 (2016). 
https:/ /doi.org/10.1007 /s00450-016-0327-2 

18. Venkatesh, A., Kandalla, K.C., Panda, D.K.: Evaluation of energy characteris- 
tics of MPI communication primitives with RAPL. In: Proc. Int. Symposium 
on Parallel & Distributed Processing (IPDPSW). pp. 938-945. IEEE (2013). 
https:/ /doi.org/10.1109/IPDPSW.2013.243 

19. Weaver, V.M., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terp- 
stra, D., Moore, S.: Measuring energy and power with PAPI. In: Proc. Int. Con- 
ference on Parallel Processing Workshops (ICPPW). pp. 262-268. IEEE (2012). 
https:/ /doi.org/10.1109/ICPPW.2012.39 


Open Access. This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http:/ /creativecommons.org/licenses/by /4.0/), 
which permits use, sharing, adaptation, distribution, and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the source, 
provide a link to the Creative Commons license and indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter's Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will need 
to obtain permission directly from the copyright holder. 


Logic and Proof 


® 


Check for 
updates 


Practical Machine-Checked Formalization of 
Change Impact Analysis 


Karl Palmskog!, Ahmet Celik?, and Milos Gligoric? 


! KTH Royal Institute of Technology, Stockholm, Sweden 
? Facebook, Seattle, WA, USA 
3 'The University of Texas at Austin, Austin, TX, USA 
palmskog@kth.se, celik@fb.com, gligoric@utexas.edu 


Abstract. Change impact analysis techniques determine the compo- 
nents affected by a change to a software system, and are used as part 
of many program analysis techniques and tools, e.g., in regression test 
selection, build systems, and compilers. The correctness of such analyses 
usually depends both on domain-specific properties and change impact 
analysis, and is rarely established formally, which is detrimental to trust- 
worthiness. We present a formalization of change impact analysis with 
machine-checked proofs of correctness in the Coq proof assistant. Our 
formal model factors out domain-specific concerns and captures system 
components and their interrelations in terms of dependency graphs. Us- 
ing compositionality, we also capture hierarchical impact analysis for- 
mally for the first time, which, e.g., can capture when impacted files are 
used to locate impacted tests inside those files. We refined our verified im- 
pact analysis for performance, extracted it to efficient executable OCaml 
code, and integrated it with a regression test selection tool, one regres- 
sion proof selection tool, and one build system, replacing their existing 
impact analyses. We then evaluated the resulting toolchains on several 
open source projects, and our results show that the toolchains run with 
only small differences compared to the original running time. We believe 
our formalization can provide a basis for formally proving domain-specific 
techniques using change impact analysis correct, and our verified code 
can be integrated with additional tools to increase their reliability. 


Keywords: Change impact analysis - Regression test selection - Coq. 


1 Introduction 


Change impact analysis aims to determine the components affected by a change 
to a software system, e.g., the modules or files affected by a modified line of 
code [3,4]. Change impact analysis techniques are used in many program analyses 
and tools, such as regression test selection (RTS) tools [26, 52, 59, 61], build 
systems [15,21,43,45], and incremental compilers [48]. 

Change impact analysis techniques typically mix domain- and language- 
specific concepts, such as method call graphs and class files, with more abstract 
notions, such as dependencies, transitive closures, and topological sorts. This can 
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complicate reasoning about the correctness (safety) of a technique. For exam- 
ple, to the best of our knowledge, RTS techniques for Java-like languages have 
never been argued to be safe (i.e., to never omit tests affected by a change) by 
machine-checked reasoning—only by high-level pen-and-paper proofs [51,55,60]. 


In this paper, we present a formalization of key concepts used in many change 
impact analysis techniques—concepts that are independent of any language or 
application domain. Our formalization represents system components and their 
interrelations as vertices and edges in explicit dependency graphs. We consider 
whether components are impacted by changes between two system revisions by 
computing transitive closures of modified graph vertices in the inverse of the 
dependency graph from the old revision. This has been described as “invalidating 
the upward transitive closure" [14]. Among impacted vertices, we identify those 
that are checkable, representing, e.g., a test method, that can be re-executed. 


We encoded our formal model as a library in the Coq proof assistant, and 
proved two key correctness properties: soundness and completeness. Soundness, 
intuitively, states that the outcomes of executing checkable vertices that are 
unimpacted in the new revision are the same as they would be in the previ- 
ous revision. Completeness roughly states that all checkable vertices in the new 
revision are members of the set of all added, impacted, and unimpacted vertices. 


Based on our correctness approach, we also defined and proved correct two 
strategies for hierarchical change impact analysis that are roughly analogous 
to, on the one hand, file-based incremental builds [43, 54], and on the other 
hand, hybrid regression test selection [46,60]. To the best of our knowledge, 
hierarchical change impact analysis is previously unexplored in formal settings 
like ours. Ultimately, by proving some basic properties about relations between 
vertices and results of executing checkable vertices, developers can use our model 
and library to obtain end-to-end guarantees for domain-specific impact analyses. 


To capture our model of system components and their dependencies in Coq, 
we used the Mathematical Components (MC) library [42] and its representation 
of relations, finite graphs, and subtypes [25,28,29]. For the formal proofs, we used 
the SSReflect proof language and followed the idiom of the MC library of lever- 
aging boolean decision procedures in proofs via small-scale reflection |9, 30, 31]. 
To obtain efficient executable code, we performed several verified refinements of 
our initial Coq encoding. From our refined functions and datatypes, we then de- 
rived a practical tool, dubbed CHIP, by carefully extracting Coq code to OCaml 
and linking it with an assortment of OCaml libraries. CHIP can be viewed as a 
verified component for change impact analysis that can either be integrated into 
verified systems or used in conventionally developed systems. 


To ensure the adequacy of our formal model, we performed an empirical 
study using CHIP. Specifically, we integrated CHIP with EKSTAZI [26], a tool for 
class-based regression test selection in Java, with 1CoQ [11], a tool for regression 
proof selection in Coq itself, and with Tup [54], a build system similar to make, 
replacing the existing components for change impact analysis in all these tools. 
We then compared the outcome and running time between the respective mod- 
ified and original tool versions when applied to the revision histories of several 
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open-source projects. This approach is along the lines of previous evaluations of 
formal specifications [8, 20,33] and RTS techniques [26,37,60]. During our eval- 
uation of CHIP, we also located and addressed several performance bottlenecks. 


We make the following contributions: 

— Basic formal model: We present a formalization of change impact analysis 
in terms of finite graphs and sets, encoded in the Coq proof assistant via the 
MC library. We formulated and proved in Coq key correctness requirements 
for our analysis, namely, soundness and completeness. 

— Hierarchical formal model: We extended our model to capture two strate- 
gies for hierarchical change impact analysis, where higher-level components 
are implicitly tied to lower-level components, and proved them both correct. 

— Library: Our Coq development forms a library of definitions and lemmas 
that can assist in formally proving various techniques based on change impact 
analysis, such as regression test selection for Java, correct inside Coq. 

— Optimizations: We refined our verified Coq functions and data structures 
to significantly improve performance in practice of code extracted to OCaml. 

— Tool: From our refined Coq code, we derived a verified executable tool in 
OCaml for change impact analysis, CHIP, by carefully leveraging Coq’s code 
extraction mechanism. CHIP can be used as a verified component for both 
regular and hierarchical change impact analysis in other tools. The CHIP code, 
compatible with Coq 8.9, MC 1.7, and OCaml 4.07, is publicly available [47]. 

— Evaluation: We integrated CHIP with a tool for regression test selection 
in Java projects, EKSTAZI, one regression proof selection tool for Coq itself, 
ICOQ, as well as one build system, TUP, and evaluated the resulting toolchains 
on several medium to large-scale open source projects. 


2 Background 


In this section, we give some brief background on change impact analysis and 
its applications, and on the Coq proof assistant. 


2.1 Change Impact Analysis 


Broadly, we consider change impact analysis as the activity of identifying the 
potential consequences of a change to a software system. Formulated in this way, 
change impact analysis is an old concern in software engineering [4], and remains 
an active research topic as part of techniques and tools [1,34,53]. In early work, 
Arnold posited computing transitive closures of statically derived program call 
graphs as the fundamental technique for change impact analysis [3]. However, 
later research argues that dynamic analysis can be more precise [36] and lead to 
faster dependency collection for use in future analyses [26]. Our work aims to 
capture general concepts used in both static and dynamic approaches [10, 38]. 


2.2 Regression Test Selection and Regression Proof Selection 


Regression test selection (RTS) techniques optimize regression testing — running 
tests at each project revision to check correctness of recent changes — by dese- 
lecting tests that are not affected by the recent changes [50,59]. Traditionally, 
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RTS techniques maintain for each test a set of code elements (e.g., statements, 
methods, classes) on which the test depends. When code elements are modified, 
change impact analysis is used to detect those tests that are potentially affected 
by the changes. Prior work has studied RTS for various programming languages 
(e.g., C, C++, and Java), built dependency graphs statically or dynamically, 
and used various granularities of code elements (e.g., statements, methods, and 
classes). The meaning of the dependency graph is language-specific, but if the 
graph is properly constructed, the change impact analysis is independent of the 
language. For example, EKSTAZI [26], a recent RTS tool for Java projects, builds 
and maintains Java class file dependency graphs dynamically, and when a class 
file is modified, EKSTAZI uses change impact analysis to select all test classes 
that depend, directly or indirectly, on the modified class. 

Regression proof selection (RPS) is the analogue of RTS for formal proofs, 
which, similarly to tests, can take a long time to check. The RPS technique 
implemented in the 1CoQ tool for Coq [12] uses hierarchical selection [11], where 
impacted files are used to locate impacted proofs to be checked. 


2.3 Build Systems 


The classic build system make uses file timestamp comparisons to decide whether 
a task defined in a build script should be run. Dependency graphs are implic- 
itly defined by tasks depending on the completion of other tasks, or on certain 
files, as expressed in the build script. In contrast to test execution, build script 
task execution typically produces side effects in the form of new files, e.g., files 
with object code in ELF format. Modern build systems such as Bazel [5] and 
CloudMake [21,27] can use other ways than timestamps to find modified files, 
e.g., comparing cryptographic hashes of files across revisions. Recent alternative 
build systems that aim to replace make include TUP [54] and Shake [43]; the 
former uses an explicit persisted dependency graph. 


2.4 The Coq Proof Assistant and Mathematical Components 


Coq consists of, on the one hand, a small and powerful purely functional pro- 
gramming language, and on the other hand, a system for specifying properties 
about programs and proving them [6]. Coq is based on a constructive type the- 
ory [17,18] which effectively reduces proof checking to type checking, and puts 
programming on the same foot as proving. Mathematical Components (MC) [42] 
is an extensive Coq library that provides many structures from mathematics, in- 
cluding finite sets, relations, and subtypes; we use the module fingraph, which 
was derived from Gonthier's proof of the four-color theorem [28]. 

Datatypes and functions verified inside Coq to have some correctness prop- 
erty can be extracted to a practical programming language such as OCaml [40], 
and then integrated with libraries; extraction is used in several large-scale soft- 
ware verification projects [39, 57]. Obtaining efficient programs via extraction 
may require significant engineering because of discord between the requirements 
for formal correctness and agreeable program runtime behavior [19]. When target 
languages lack fully formal semantics, as is the case for OCaml, extraction cannot 
be fully trusted, but empirical evaluations are nevertheless encouraging [24,58]. 
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3 Formal Model 


This section introduces our model, assumptions, and correctness approach. 


3.1 Definitions 

Components: Our model of change impact analysis uses two finite sets of ver- 
tices V and V’, where V C V’. Members of these sets represent the components 
of a system (e.g., files or classes) before and after a change, respectively. 
Artifacts: We let A be a set of artifacts. An artifact is intended to be a concrete 
underlying representation of a component, e.g., an abstract syntax tree or the 
content of a file. We assume that the equality of two artifacts is decidable, i.e., 
that we can compute for all a,a’ € A whether a = a’ or a # a’. To associate 
vertices with artifacts, we use two total functions f: V — A and f’: V’ > A. 
In practice, we expect these functions to map vertices to compact summaries of 
component representations, such as checksums computed by cryptographic hash 
functions. Whenever f(v) Z f'(v) for some v € V, we say that the artifact for v 
is modified after the revision; otherwise, it is unmodified. 

Graphs: Let g be a binary relation on V. For v, v' € V, we say that v directly 
depends on v' if g(v, v') holds. For example, if v and v’ represent classes in a Java- 
like language, v may be a subclass of v’. We will usually refer to relations like g 
as (dependency) graphs. We write g^! for the inverse of g, i.e., we have g^ (v, v’) 
iff g(v', v). Moreover, we write g*(v, v) for when v and v’ are transitively related 
in g, and say that v transitively depends on v'. We define the reflexive-transitive 
closure of a vertex v € V with respect to a graph g as the set (v' | g*(v, v^), 
i.e., as the set of all vertices reachable from v in g (which includes v itself). 
Execution: We assume there is a subset E C V’ of checkable vertices, i.e., it is 
meaningful to apply some (side-effect free) function check on them and obtain 
some result. For example, a checkable vertex may represent a test method that 
either passes or fails when executed. 

Impactedness: Let g be a dependency graph. We then say that a vertex v € V 
is impacted if it is reachable in g^! from some modified vertex. Equivalently, 
v is impacted iff there is a v’ € V such that f(v’) 4 f'(v) and (g-1)*(v',v). 
Additionally, a vertex v" € V' is considered fresh whenever v" ¢ V. 

We take the (disjoint) union of the set J of impacted vertices and the set F 
of fresh vertices, and consider the checkable vertices in this set, i.e., vertices in 
(I UF ) N E. Intuitively, these are the only vertices that we need to consider in 
the new revision, since all other vertices in V' are unimpacted—and using check 
on unimpacted vertices will have the same outcome as in the old revision. 


3.2 Example 

Figure 1 illustrates the core idea of the graph-based change impact analysis ap- 
proach we model. Figure 1(a) shows the original dependency graph, where, e.g., 
component 3 depends directly on components 1 and 2, and 5 depends directly 
on 3 and transitively on 1 and 2; dotted components are checkable. Figure 1(b) 
shows the inverse graph, with the modified component 1 bolded, and the com- 
ponents impacted by the change in gray (the reflexive-transitive closure of 1 in 
the inverse graph). Based on these results, we call check on 5, but not on 6. 
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Fig. 1. Dependency graph where component 1 is changed, impacting 3 and 5. 


3.3 Correctness Approach 


For correctness, we intuitively show that executing only impacted and fresh 
vertices that are checkable is enough in the new revision, since the result of 
executing unimpacted vertices is the same as in the old revision. This means 
that if we have access to the results of checking vertices in the old revision, we 
can use those results to obtain the complete outcome for all checkable vertices 
in the new revision, without going through the work usually required. 

Having constructed the set T of tuples of checkable vertices and outcomes 
from the impacted, fresh, and unimpacted vertices, we can ask (1) whether T 
is complete, i.e., whether it contains outcomes for all checkable vertices in V’, 
and (2) whether the outcomes in T are sound, i.e., if they are same as if we had 
explicitly called check on the associated vertices. 

To be able to prove soundness and completeness, we need to assume several 
properties relating the dependency graphs and outcomes of executing vertices in 
both revisions. Informally, we make the following assumptions: 


Al: The direct dependencies of a vertex v are the same in both revisions if the 
artifact of v is the same in both revisions, i.e., if f(v) = f'(v). 

A2: A vertex v with the same artifact in both revisions is checkable in the new 
revision iff v is checkable in the old revision. 


A3: The outcome of executing a checkable vertex v is the same in both revisions 
if the sets of vertices v depends on transitively are the same, and the artifact 
of each dependency is the same. 


The last assumption implicitly rules out that the underlying operation (e.g., test 
execution) on a vertex is nondeterministic, which it can be in practice [41]. 


4 Model Encoding 


In this section, we give an overview of our encoding in Coq of the formal model 
described in the previous section, using theories of finite sets and graphs from the 
MC library. We use a simplified version of Coq’s specification language, Gallina. 


4.1 Encoding in Coq 


We represent the vertex set V’ as a finite type (finType) V’, and its subset V as 
a subtype (subType) V, induced by a decidable predicate P on vertices in V’ (of 
type pred V’). This allows us to define the graph g as a binary decidable relation 
g on V, i.e., a variable of type rel V, and use the MC library predicate connect 
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to express whether two vertices are transitively related in g. The inverse of g is 
defined as [rel x y | g y x], which we write as g~'. We use connect to form the 
set of vertices in the reflexive-transitive closure of a given vertex x with respect 
to a graph g, and a canonical big operator [7] to form the union of all such 
closures for elements in a given set m of modified vertices: 


Def impacted (g : rel V) (m : {set V}) : (set V} :— 
\bigcup_( x | x \in m) [set y | connect gx y]. 


We characterize this function through MC’s reflect (“if and only if"): 
Thm impactedP gmx: reflect (d v, v Nin m & connect g v x) (x Vin impacted gm). 


The MC library function val injects a subtype element into the corresponding 
supertype. We use this to capture impacted and fresh vertices in V': 


Def impacted V'm: (set V’} :— [set (val v) | v in impacted g ! m]. 
Def fresh V': {set V’} := [set v | ^ P v]. 


We represent the set of artifacts A as a type A with decidable equality (eqType), 
and functions f and f’ as regular Coq functions f and f'. This allows us to 
define the set of modified vertices in V', and then take the union (operator :|:) 
of impacted and fresh vertices: 


Def mod. V : (set V} := [set v | f v !— f’ (val v)]. 
Def impacted fresh V': [set V’} :— impacted V' mod V :|: fresh V'. 


We then use a predicate checkable to form the subset of vertices in V' that can 
be executed: 


Def chk impacted, fresh V':(set V'] :— [set v in impacted fresh. V' | checkable v]. 


We use a function check, which takes a vertex and returns a term in a result 
type R (an eqType, e.g., bool), to define a sequence of vertices and results: 


Def res impacted fresh V': seq (V' * R) :— 
[seq (v, check v) | v + enum chk impacted fresh. V]. 


Note that by using a sequence instead of a finite set for these tuples, we ensure 
R can be any type with decidable equality, such as a message of arbitrary length. 


4.2  Correctness Statements 


For stating and proving correctness, we assume we have dependency graphs for 
the old and new revision, as well as definitions of whether vertices are checkable, 
and checking functions: 


Vars (g : rel V) (g’: rel V’). 
Vars (checkable : pred V) (checkable' : pred V’) (check : V > R) (check : V — R). 


We then define the graph g for vertices in V’, named g_v’: 


Def insub_g (x y : V’) : bool := match insub x, insub y with 
Some x’, Some y > gx’ y | _, _ => false end. 
Def g_V’: rel V’ := [rel x y | insub_g x y]. 
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This allows us to formulate the assumption Al from above: 

Hyp fg_eq: V (v : V), f v = f’ (val v) > V (v: V), g_W (val v) v = g (val v) v’. 
The assumption A2 is equally straightforward to define: 

Hyp chk_f : V v, f v = f’ (val v) — checkable v = checkable’ (val v). 


Finally, the assumption A3, when formalized, establishes a relation between 
vertices in g and g’: 


Hyp chk_V : V (v : V), checkable v — checkable’ (val v) > 
(V (v': V), connect g_V’ (val v) v’ = connect g’ (val v) v’) > 
(V (v: V), connect g. V (val v) (val v’) > 
f v? = f’ (val v’)) — check v = check’ (val v). 


We now assume we are given a sequence of results for checkable vertices in the 
old revision, and that this sequence is sound, complete, and duplicate-free: 


Var res_V : seq (V * R). 
Hyp res. VP : V v r, reflect (checkable v ^ check v = r) ((v,r) Mn res. V). 
Hyp res. v. uniq : uniq [seq vr.1 | vr — res. V]. 


We can then filter the sequence of old results to locate unimpacted vertices in 
the new revision: 


Def res_unimpacted_V’ : seq (V' * R) := [seq (val vr.1, vr.2) | 
vr + res V & val vr.1 \notin impacted V' mod V]. 


'This allows us to form a final sequence of vertex-result pairs: 
Def res. V' : seq (V' * R) :— res. impacted fresh V' ++ res unimpacted V'. 
For sanity-checking, we prove the absence of duplicates: 


Def chk_V’ : seq V’ :— [seq vr.1 | vr + res. V]. 
Thm chk_V’_unigq : uniq chk_V’. 


We prove that the sequence contains all checkable vertices in v’ (completeness): 
Thm chk_V’_complete (v : V’) : checkable’ v > v Vin chk_V’. 


Finally, we prove that the results in the sequence are consistent with explicitly 
calling check’ on all vertices in V’ (soundness): 


Thm chk_V’_sound (v : V) (r : R): (v, r) \in res_V’ — checkable’ v ^ check’ v =r. 


'The formal proofs, which we elide here, mostly reduce to reasoning over the 
connect predicate and inductively on graph paths. 


5 Component Hierarchies 


Let V be a set of vertices representing fine-grained components (e.g., methods), 
with dependency graph g,. Let U be a different set of vertices representing 
coarse-grained components (e.g., files), associated with a function p: U — 2V 
that defines a partition of V . The partition indicates how components in U encap- 
sulate components in V, and is associated with a graph gr of vertices in U that is 
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Fig. 2. Hierarchy with component sets U and V, partition p, and dependencies. 


consistent with dependencies expressed in gı. This approach can be repeated to 
produce component hierarchies, each time coalescing sets of finer-grained depen- 
dencies into single coarser-grained dependencies. Figure 2 illustrates a two-level 
hierarchy and its component dependencies. 

Some change impact analysis techniques consider both fine-grained and coarse- 
grained component levels [11, 46,60]. A key idea behind these techniques is to 
exploit the relationships between vertices across granularity levels. In particular, 
if a vertex u € U is unmodified after a change, we may be able to immediately 
conclude that all vertices v € p(u) are unmodified as well, potentially ruling out 
that a large subset of V is impacted. In this section, we formalize this intuition 
using our existing notions to express hierarchical change impact analysis. 


5.1 Formal Model of Hierarchies 


Let f, and fi be the functions mapping vertices to artifacts for V and V’ with 
V CV’, and let fr and f$ be the corresponding functions for U and U' with 
U CU’. Let p and p’ be partition-inducing functions from U and U” to subsets 
of V and V’, respectively. We make the following assumptions: 


H1: For all u, v € U and v,v' € V, if u Z w, gi(v,v’), v € p(u), and v' € p(w’), 
then gr (u, v). 

H2: For all u € U, if fr(u) = f'-(u), then p(u) = p' (u). 

H3: For all u € U and v € V, if fr(u) = f. (u) and v € p(u), then fi (v) = fi (v). 


Intuitively, H1 expresses that whenever two fine-grained components that reside 
in different coarse-grained components are related, there must be a correspond- 
ing relation between their respective coarse-grained components. H2 expresses 
that whenever a coarse-grained component is unchanged, it contains the same 
fine-grained components as before. Finally, H3 expresses that a fine-grained com- 
ponent is unchanged if the coarse-grained component that contains it is un- 
changed. Under these assumptions, there are essentially two distinct strategies 
we can use to leverage impact analysis for coarse-grained components to analyze 
fine-grained components. 

Overapproximation strategy: Let U; be the set of impacted and fresh vertices 
in U', computed as above without considering vertices in V'. Consider the set 


V =U. eu; p'(u) which contains fresh and potentially impacted vertices in V”. 
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Executing all checkable vertices in V; may perform needless work for unimpacted 
vertices, but completely elides analysis of g,. This approach essentially corre- 
sponds to relying on comparing whole files to decide whether to rerun commands 
that operate on every component inside these files, as in make. 
Compositional strategy: Let U; be the set of impacted vertices in U, com- 
puted as above. Consider the set Vp = LJ, ep, p(u) of potentially impacted ver- 
tices in V. We use this set to scope further analysis. In particular, we use the 
subgraph gp of gq induced by V, to precisely find the impacted vertices in V. 
While unimpacted vertices are then avoided, the additional analysis of g, may 
be time-consuming to perform compared to the first strategy. At a high level, 
this strategy corresponds to the one used in RPS [11] and hybrid RTS [60]. 


5.2 Encoding and Correctness in Coq 

To encode hierachical analysis, we use finite types and functions (now suffixed by 
top and bot) in the same way as before, while adding partitioning assumptions: 
Vars (p : U — {set V}) (p’: U — {set V’}). 

Hyp p. pt : partition (\bigcup_(u | u Vin U) [set (p u)]) [set: V]. 

Hyp p'. pt : partition (\bigcup_(u | u in V’) [set (p’ u)]) [set: V]. 

For the overapproximation strategy, we first define impacted sets: 

Def if top: {set U’} :— impacted fresh. V' f' top f top g top. 

Def p' if bot: (set V’} := \bigcup_( u | u Min if. top ) (p’ u). 

Under the assumptions outlined above, we then show formally that p'. if. bot is 
a superset of the results of analysis of V, V', and the graph g. bot: 

Thm in. p' (v : V) : v Nin impacted fresh, V' f! bot f bot g bot — v Vin p' if bot. 
The key fact we use to prove this theorem is the following: 


Thm connect top botv v uu’: v \in (pu) > v \in (pw) > 
connect g bot v v’ — connect g top u v. 


To encode the compositional strategy, we first define impacted sets: 
Def i top: {set U} :— impacted g top | (mod V f! top f top). 
Def p.i bot: (set V} :— \bigcup_( u | u Vin i. top ) (p u). 


Then, we define a subtype and accompanying graph: 
Def P. V. sub : pred V :— fun v > v Mn p. i bot. 


Def V. sub : finType :— sig finType P V. sub. 
Def g bot sub : rel V sub := [rel x y | g_bot (val x) (val y)]. 


'This allows us to use our previously defined analysis functions compositionally: 


Def mod, V. sub := [set v : V. sub | val v Vin mod. V f’_bot f bot]. 
Def impacted, V. sub :— impacted g bot sub ! mod, V. sub. 

Def impacted V' sub := [set val (val v) | v in impacted. V. sub]. 
Def impacted, fresh V' sub :— impacted, V' sub :|: fresh V' P bot. 


We finally show that the last set is the same as the one we would have obtained 
by directly analysing the graph g. bot: 
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Thm impacted_fresh_V’_sub_eq : 
impacted_fresh_V’_sub = impacted_fresh_V’ f’_bot f_bot g_bot. 


Using these definitions and results, we proved soundness and completeness for 
both strategies using the same approach as in Section 4.2. 


6 Tool Implementation 


While our core definitions of change impact analysis described in Section 4 are 
executable inside Coq, this does not mean they are efficient or that code ex- 
tracted from the definitions is immediately usable. We describe two aspects of 
bringing verified Coq code into our tool CHIP: optimizations and encapsulation. 


6.1 Optimizations 


Our basic transitive closure function impacted is simple to reason about but not 
particularly fast in practice, since it fully explores the closures of all elements 
in the set of modified vertices. To mitigate this, we refined the function by 
leveraging the depth-first search function dfs from the fingraph MC module 
to incrementally compute the closure. dfs takes a graph as a function from 
vertices to neighbor sequences and a depth bound, and terminates as soon as it 
encounters a known vertex. We perform a stack-efficient left fold with dfs over 
an input sequence of vertices: 


Def clos (1: seq V) : seq V :— foldl (dfs g #|V|) [::] 1. 


Note that we set the dfs depth bound to the number of elements in the finite 
type V (written #|V|) to fully explore the graph g. However, one limitation of the 
MC afs function is its linear-time sequence membership lookups. We therefore 
defined a better closure function with logarithmic membership lookup time using 
sets backed by red-black trees as found in the Coq standard library [2, 23]: 


Fixpoint sdfs (g : V — seq V) (n: nat) (s : RBT.t) (x: V) : RBT.t := 
if RBT.mem x s then s else 
if n is n’.+1 then foldl (sdfs gn’) (RBT.add x s) (g x) else s. 
Def sclos (1: seq V) : seq V :— RBT.elements (foldl (sdfs g #|V|) RBT.empty 1). 


We used this closure function to define a function seq. impacted fresh which we 
proved extensionally equivalent to impacted fresh. V' defined in Section 4.1. We 
also added many custom extraction directives in Coq to ensure the extracted 
code uses efficient OCaml library functions, e.g., for list operations [22]. 


6.2 Encapsulation 


Before extraction to OCaml, we instantiate the finite types for graph vertices 
to ordinal finite types, which intuitively contain all natural numbers from 0 up 
to (but not including) some bound k. These numbers can then become machine 
integers during extraction, which allows us to provide a simple OCaml interface: 


val impacted fresh : int -> int -> (int -> string) -> 
(int -> string) -> (int -> int list) -> int list 
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Here, the first argument is the number of vertices in the new graph, while the 
second is the number of vertices in the old graph. After these integers follow two 
functions that map new and old vertices, respectively, to their artifacts in the 
form of OCaml strings. Then comes a function that defines the adjacent vertices 
of vertices in the old graph. The result is a list of impacted and fresh vertices. 

Not all computationally meaningful types in Coq can be directly represented 
in OCaml’s type system. Some function calls must therefore circumvent the type 
system by using calls to the special 0bj.magic function [40]. We use this approach 
in our implementation of the above interface: 


let impacted_fresh num_new num_old f’ f succs = 

Obj .magic (ordinal_seq_impacted_fresh num_new num_old 
(Obj.magic (fun x -> char_list_of_string (f’ x))) 
(Obj.magic (fun x -> char_list_of_string (f x))) 

(Obj .magic succs)) 


The interface and implementation for two-level compositional hierarchical se- 
lection is a straightforward extension, with an additional argument p of type 
int -> int list for between-level partitioning. 


7 Evaluation of the Model 


To evaluate our model and its Coq encoding, we performed an empirical study 
by integrating CHIP with a recently developed RTS tool, EKSTAZI, one RPS 
tool, ICoQ, and one build system, TUP. We then ran the modified RTS tool on 
open-source Java projects used to evaluate RTS techniques [26,37], the modified 
RPS tool on Coq projects used in its evaluation [11], and the modified build 
system on C/C++ projects. Finally, we compared the outcomes and running 
times with those for the unmodified versions of EKSTAZI, ICOQ, and TUP. 


7.1 Tool Integration 


Integrating CHIP with EKSTAZI was challenging, since EKSTAZI collects depen- 
dencies dynamically and builds only a flat list of dependencies rather than an 
explicit graph. To overcome this limitation, we modified EKSTAZI to build an ex- 
plicit graph by maintaining a mapping from method callers to their callees. The 
integration with ICoQ was also challenging because of the need for hierarchical 
selection of proofs and support for deletion of dependency graph vertices. We 
handle deletion of a vertex in ICOQ by temporarily adding it to the new graph 
with a different artifact (checksum) from before, marked as non-checkable; then, 
after selection, we purge the vertex. In contrast, the integration with TUP was 
straightforward, since T'UP stores dependencies in an SQLite database. We sim- 
ply query this database to obtain a graph in the format expected by CHIP. 


7.2 Projects 


RTS: We use 10 GitHub projects. Table 1 (top) shows the name of each project, 
the number of lines of code (LOC) and the number of tests in the latest version 
control revision we used in our experiments, the SHA of the latest revision, and 
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Table 1. List of Projects Used in the Evaluation (RTS at the Top, RPS in the Middle, 
and TUP at the Bottom). 


Project LOC #Tests SHA URL (github.com/) 
Asterisk 57,219 257 e36c655f asterisk-java/asterisk-java 
Codec 22,009 887 58860269 apache/commons-codec 
Collections 66,356 24,589  d83572be apache/commons-collections 
Lang 81,533 4,119  c3de2d69 apache/commons-lang 
Math 186,388 4,858  eb57d6d4 apache/commons-math 
GraphHopper 70,615 1,544 14d2d670 graphhopper/graphhopper 
La4j 13,823 799 e77dca70 vkostyukov/la4j 
Planner 82,633 398 £12e8600 opentripplanner/OpenTripPlanner 
Retrofit 20,476 603 7a025icc square/retrofit 
Truth 29,591 1,448  14f72173 google/truth 
Total 630,643 39,502 N/A N/A 
Project LOC #Proofs SHA URL 
Flocq 33,544 943 4161c990 gitlab.inria.fr/flocq/flocq 
Struct Tact 2,497 187 8f1bc10a github.com/uwplse/StructTact 
UniMath 45,638 755 5e525£08 github.com/UniMath/UniMath 
Verdi 57,181 2,784 15be6f61 github.com/uwplse/Verdi 
Total 138,860 4,669 N/A N/A 
Project LOC #Cmds SHA URL (github.com/) 
guardcheader 656 5  dbdicOf kalrish/guardcheader 
Lazylterator 1,276 18 | d5f0b64 matthiasvegh/Lazylterator 
libhash 347 10  b22c27e fourier/libhash 
Redis 162,366 213 . 39c70e7 antirez/redis 
Shaman 925 7 | T3c048d HalosGhost /shaman 
Tup 200,135 86 £77dbd4 gittup/tup 
Total 365,705 339 N/A N/A 


URL on GitHub. We chose these projects because they are popular Java proj- 
ects (in terms of stars) on GitHub, use the Maven build system (supported by 
EKSTAZI), and were recently used in RTS research [37, 60]. 

RPS: We use 4 Coq projects. Table 1 (middle) shows the name of each project, 
the number of LOC and the number of proofs in the latest revision we used, 
the latest revision SHA, and URL. We chose these projects because they were 
used in the evaluation of 1CoQ [11]; as in that evaluation, we used 10 revisions 
of StructTact and 24 revisions of the other projects. 

Build system: We use 6 GitHub projects. Table 1 (bottom) shows the name 
of each project, the number of LOC and the number of build commands in 
the latest revision we used, the latest revision SHA, and URL. We chose these 
projects from the limited set of projects on GitHub that use Tup. We looked 
for projects that could be built successfully and had at least five revisions; the 
largest project that met these requirements, in terms of LOC, was TupP itself. 


7.3 Experimental Setup 


Our experimental setup closely follows recent work on RTS [37,60]. That is, 
our scripts (1) clone one of the projects; (2) integrate the (modified) EKSTAZI, 
ICOQ, or TUP; and (3) execute tests on, check proofs for, or build the (up to) 24 
latest revisions. For each run, we recorded the end-to-end execution time, which 
includes time for the entire build run. We also recorded the execution time for 
change impact analysis alone. Finally, we recorded the number of executed tests, 
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Table 2. Execution Time and CIA Time in Seconds for EKSTAZI and CHIP. 


Project Total CIA 
RetestAll Ekstazi Chip|Ekstazi Chip 
Asterisk 461.92 188.67 194.65 2.74 6.51 
Codec 896.00 135.11 136.35 2.44 4.10 
Collections 2,754.99 342.07 350.95 2.87 9.31 
Lang 1,844.19 359.36 367.16 2.71 8.68 
Math 2,578.09 1,459.98 1,495.71 1.79 7.13 
GraphHopper 1,871.01 423.63 449.94 11.19 21.33 
La4j 272.96 202.10 209.41 1.12 3.91 
Planner 4,811.63 1,144.09 1,228.61 40.62 89.17 
Retrofit 1,181.09 722.14 747.76 11.30 19.97 
Truth 745.11 700.26 724.22 3.03 8.82 
Total 17,416.99 5,677.41 5,904.76 79.81 178.93 


Table 3. Execution/CIA Time in Seconds for 1CoQ and CHIP. 


Project | Total CIA 
RecheckAll iCoq Chip| iCoq Chip 
Flocq 1,028.01 313.08 318.19, 50.65 53.43 
StructTact 45.86 43.90 44.49| 14.45 14.98 
UniMath 14,989.09 1,910.56 2,026.75 124.79 239.12 
Verdi 37,792.07 3,604.23 4,637.27|139.09 1,171.57 


Total | 53,855.03 5,871.76 7,026.70|328.98 1,479.10 


proofs, or commands, which we use to verify the correctness of the results, i.e., 
we checked that the results for the unmodified tool and CHIP were equivalent. 
We ran all experiments on a 4-core Intel 17-6700 CPU @ 3.40GHz machine with 
16GB of RAM, running Ubuntu Linux 17.04. We confirmed that the execution 
time for each experiment was similar across several runs. 


7.4 Results 


RTS: Table 2 shows the execution times for EKSTAZI. Column 1 shows the 
names of the projects. Columns 2 to 4 show the cumulative end-to-end time for 
RetestAll (i.e., running all tests at each revision), the unmodified RTS tool, and 
the RTS tool with CHIP. Columns 5 and 6 show the cumulative time for change 
impact analysis (CIA time). The last row in the table shows the cumulative ex- 
ecution time across all projects. We have several findings. First, EKSTAZI with 
CHIP performs significantly better than RetestAll, and only slightly worse than 
the unmodified tool. Considering that we did not prioritize optimizing the inte- 
gration, we believe that the current execution time differences are small. Second, 
the CIA time using CHIP is slightly higher than the CIA time for the unmodified 
tool, but we believe this could be addressed by integrating CHIP via the Java 
Native Interface (JNI). The selected tests for all projects and revisions were the 
same for the unmodified EKSTAZI and EKSTAZI with CHIP. 

RPS: Table 3 shows the total proof checking time for ICOQ and the CIA time 
for ICOQ and CHIP. All time values are cumulative time across all the revi- 
sions we used. We find that ICoQ with CHIP has only marginal differences in 
performance from ICOQ for all but the largest project, Verdi. While 1CoQ with 
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CHIP is notably slower in that case, it still saves a significant fraction of time 
from checking every revision from scratch (RecheckAll). Struct'Tact is an outlier 
in that RecheckAll is actually faster than both ICoQ and 1CoQ with CHIP, due 
to the overhead from bookkeeping and graph processing in comparison to the 
project's relatively small size. The selected proofs for all projects and revisions 
were the same for the unmodified ICOQ and ICOQ with CHIP. 

Build system: Table 4 shows the total execu- Table 4. Execution Time in 
tion time for TUP and the CIA time for Tup  Milliseconds for TuP and CHIP. 
and CHIP. All time values are cumulative time project Tup „ CHA. 
across all the revisions we used. Unfortunately, ik alae 


B è : E guardcheader| 20,358 1,788 1,785 
the build time for most of the projects is short. Lazylterator | 61,476 869 1.007 


However, we can still observe that CHIP takes libhash 15,279 433 446 

: . i Redis 68,076 1,919 4,779 
only slightly more time than the original tool to gu 8702 '609 614 
perform change impact analysis. In the future, Tup 87,547 1,949 4,168 
we plan to evaluate our toolchain on larger proj- Total 261,438 7,567 12,799 


ects. The lists of commands for all projects and 

all revisions were the same for the unmodified TUP and Tup with CHIP. 
Overall, we believe these results indicate that our formal model is practically 

relevant and that it is feasible to use CHIP as a verified component for change 

impact analysis in real-world tools. 


8 Related Work 


Formalizations of graph algorithms: Pottier [49] encoded and verified Kosa- 
raju's algorithm for computing strongly connected graph components in Coq. He 
also derived a practical program for depth-first search by extracting Coq code 
to OCaml, demonstrating the feasibility of extraction for graph-based programs. 
Théry subsequently formalized a similar encoding of Kosaraju's algorithm in 
Coq using the MC fingraph module [56]. Théry and Cohen then formalized 
and proved correct Tarjan's algorithm for computing strongly connected graph 
components in Coq [13,16]. Our formalization takes inspiration from Théry and 
Cohen's work, and adapts some of their definitions and results in a more applied 
context, with focus on performance of extracted code. Similar graph algorithm 
formalizations have also been done in the Isabelle/HOL proof assistant [35]. In 
work particularly relevant to build systems, Guéneau et al. [32] verified both 
the functional correctness and time complexity of an incremental graph cycle 
detection algorithm in Coq. In contrast to our reasoning on pure functions and 
use of extraction, they reason directly on imperative OCaml code. 

Formalizations of build systems: Christakis et al. [15] formalized a general 
build language called CloudMake in the Dafny verification tool. Their language 
is a purely functional subset of JavaScript, and allows describing dependencies 
between executions of tools and files. Having embedded their language in Dafny, 
they verify that builds with cached files are equivalent to builds from scratch. 
In contrast to the focus on generating files in CloudMake, we consider a formal 
model with an explicit dependency graph and an operation check on vertices 
whose output is not used as input to other operations. The CloudMake for- 
malization assumes an arbitrary operation exec that can be instantiated using 
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Dafny’s module refinement system; we use Coq section variables to achieve sim- 
ilar parametrization for check. We view our Coq development as a library useful 
to tool builders, rather than a separate language that imposes a specific idiom 
for expressing dependencies and build operations. 

Mokhov et al. [45] presented an analysis of several build systems, including 
a definition what it means for such systems to be correct. Their correctness 
formulation is similar to that of Christakis et al. for cached builds, and relies 
on a notion of abstract persistent stores expressed via monads. Our vertices and 
artifacts correspond quite closely to their notions of keys and values, respectively. 
However, their basic concepts are given as Haskell code, which has less clear 
meaning and a larger trusted base than Coq or Dafny code. Moreover, they 
provide no formal proofs. Mokhov et al. [44] subsequently formalized in Haskell 
a static analysis of build dependencies as used in the Dune build system. 

Stores could be added to our model, e.g., by letting checkable vertices be 
associated with commands that take lists of file names and the current store 
state as parameters, producing a new state. However, this would in effect entail 
defining a specific build language inside Coq, which we consider outside the scope 
of our library and tool. 


9 Conclusion 


We presented a formalization of change impact analysis and its encoding and 
correctness proofs in the Coq proof assistant. Our formal model uses finite sets 
and graphs to capture system components and their interdependencies before 
and after a change to a system. We locate impacted vertices that represent, 
e.g., tests to be run or build commands to be executed, by computing transitive 
closures in the pre-change dependency graph. We also considered two strategies 
for change impact analysis of hierarchical systems of components. We extracted 
optimized impact analysis functions in Coq to executable OCaml code, yielding 
a verified tool dubbed Curr. We then integrated CHIP with a regression test 
selection tool for Java, EKSTAZI, one regression proof selection tool for Coq 
itself, ICoQ, and one build system, TUP, by replacing their existing components 
for impact analysis. We evaluated the resulting toolchains on several open-source 
projects by comparing the outcome and running time to those for the respective 
original tools. Our results show the same outcomes with only small differences 
in running time, corroborating the adequacy of our model and the feasibility of 
practical verified tools for impact analysis. We also believe our Coq library can 
be used as a basis for proving correct domain-specific incremental techniques 
that rely on change impact analysis, e.g., regression test selection for Java and 
regression proof selection for type theories. 
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Abstract. We consider the decidability of the verification problem of 
programs modulo axioms — automatically verifying whether programs 
satisfy their assertions, when the function and relation symbols are inter- 
preted as arbitrary functions and relations that satisfy a set of first-order 
axioms. Though verification of uninterpreted programs (with no axioms) 
is already undecidable, a recent work introduced a subclass of coherent 
uninterpreted programs, and showed that they admit decidable verifica- 
tion [26]. We undertake a systematic study of various natural axioms for 
relations and functions, and study the decidability of the coherent ver- 
ification problem. Axioms include relations being reflexive, symmetric, 
transitive, or total order relations, functions restricted to being associa- 
tive, idempotent or commutative, and combinations of such axioms as 
well. Our comprehensive results unearth a rich landscape that shows that 
though several axiom classes admit decidability for coherent programs, 
coherence is not a panacea as several others continue to be undecidable. 


1 Introduction 


Programs are proved correct against safety specifications typically by induction— 
the induction hypothesis is specified using inductive invariants of the program, 
and one proves that the reachable states of the program stays within the re- 
gion defined by the invariants, inductively. Though there has been tremendous 
progress in the field of decidable logics for proving that invariants are inductive, 
finding inductive invariants is almost never fully automatic. And completely au- 
tomated verification of programs is almost always undecidable. 

Programs can be viewed as working over a data-domain, with variables stor- 
ing values over this domain and being updated using constants, functions and 
relations defined over that domain. Apart from the notable exception of finite 
data domains, program verification is typically undecidable when the data do- 
main is infinite. In a recent paper, Mathur et. al. [26] establish new decidability 
results when the data domain is infinite. Two crucial restrictions are imposed — 
data domain functions and relations are assumed to be uninterpreted and pro- 
grams are assumed to be coherent (the meaning of coherence is discussed later 
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in this introduction). The theory of uninterpreted functions is an important the- 
ory in SMT solvers that is often used (in conjunction with other theories) to 
solve feasibility of loop-free program snippets, in bounded model-checking, and 
to validate verification conditions. The salient aspect of [26] is to show that entire 
program verification is decidable for the class of coherent programs, without any 
user-provided inductive invariants (like loop invariants). While the results of [26] 
were mainly theoretical, there has been recent work on applying this theory to 
verifying memory-safety of heap-manipulating programs [28]. 


Data domain functions and relations used in a program usually satisfy special 
properties and are not, of course, entirely uninterpreted. The results of [26] can 
be seen as an approximate/abstraction-based verification method in practice — 
if the program verifies assuming functions and relations to be uninterpreted, 
then the program is correct for any data domain. However, properties of the 
data domain are often critical in establishing correctness. For example, in order 
to prove that a sorting program results in sorted arrays, it is important that the 
binary relation « used to compare elements of the array is a total ordering on 
the underlying data sort. Consequently, constraining the data domain to satisfy 
certain axioms results in more accurate modeling for verification. 


In this paper, we undertake a systematic study of the verification of unin- 
terpreted programs when the data-domains are constrained using theories speci- 
fied by (universally quantified) axioms. The choice of the axioms we study are 
guided by two principles. First, we study natural mathematical properties of 
functions and relations. Second, we choose to study axioms that have a decid- 
able quantifier-free fragment of first order logic. The reason is that even single 
program executions (as defined in Section 3.2) can easily encode quantifier-free 
formulae (by computing the terms in variables, and assert Boolean combinations 
of atomic relations and equality on them). Since we are seeking decidable verifi- 
cation for programs with loops/iteration, it makes little sense to examine axioms 
where even verification of single executions is undecidable. 


Coherence modulo theories: Mathur et. al. [26] define a subclass of pro- 
grams, called coherent programs, for which program verification on uninterpreted 
domains is decidable; without the restriction of coherence, program verification 
on uninterpreted domains is undecidable. Since our framework is strictly more 
powerful, we adapt the notion of coherence to incorporate theories. A coherent 
program [26] is one where all executions satisfy two properties — memoizing and 
early-assumes. The memoizing property demands that the program computes 
any term, modulo congruence induced by the equality assumes in the execution, 
only once. More precisely, if an execution recomputes a term, the term should be 
stored in a current variable. The early-assumes restriction demands, intuitively, 
that whenever the program assumes two terms to be equal, it should do so early, 
before computing superterms of them. 
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We adapt the above notion to coherence modulo theories!. The memoizing 
and early-assumes property are now required modulo the equalities that are 
entailed by the axioms. More precisely, if the theory is characterized by a set of 
axioms A, the memoizing property demands that if a program computes a term 
t and there was another term t that it had computed earlier which is equivalent 
to t modulo the assumptions made thus far and the axioms A, then t’ must be 
currently stored in a variable. Similarly, the early-assumes condition is also with 
respect to the axioms — if the program execution observes a new assumption of 
equality or a relation holding between terms, then we require that any equality 
entailed newly by it, the previous assumptions and the axioms A do not involve 
a dropped term. This is a smooth extension of the notion of coherence from [26]; 
when A = Ø, we essentially retrieve the notion from [26]. 


Main Contributions 


Our first contribution is an extension of the notion of coherence in [26] to handle 
the presence of axioms, as described above; this is technically nontrivial and we 
provide a natural extension. 

Under the new notion of coherence, we first study axioms on relations. The 
EPR (effectively propositional reasoning) [37] fragment of first order logic is 
one of the few fragments of first order logic that is decidable, and has been 
exploited for bounded model-checking and verification condition validation in the 
literature [34,33,32]. We study axioms written in EPR (i.e., universally quantified 
formulas involving only relations) and show that verification for even coherent 
programs, modulo EPR axioms, is undecidable. 

Given the negative result on EPR, we look at particular natural axioms for 
relations, which are nevertheless expressible in EPR. In particular, we look at 
reflexivity, irreflexivity, and symmetry axioms, and show that verification of co- 
herent programs is decidable when the interpretation of some relational symbols 
is constrained to satisfy these axioms. Our proof proceeds by instrumenting the 
program with auxiliary assume statements that preserve coherence and subtle 
arguments that show that verification can be reduced to the case without axioms; 
decidability then follows from results established in [26]. 

We then show a much more nontrivial result that verification of coherent 
programs remains decidable when some relational symbols are constrained to 
be transitive. The proof relies on new automata constructions that compute 
streaming congruence closures while interpreting the relations to be transitive. 

Furthermore, we show that combinations of reflexivity, irreflexivity, symme- 
try, and transitivity, admit a decidable verification problem for coherent pro- 
gram. Using this observation, we conclude decidability of verification when cer- 
tain relations are required to be strict partial orders (irreflexive and transitive) 
or equivalence relations. 


! We adapt the definition in a way that preserves the spirit of the definition of coher- 
ence. Moreover, if we do not adapt the definition, essentially all axioms classes we 
study in this paper would be undecidable. 
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We then consider axioms that capture total orders and show that they too 
admit a decidable coherent verification problem. Total orders are also expressible 
in EPR and their formulation in EPR has been used in program verification, 
as they can be used in lieu of the ordering on integers when only ordering is 
important. For example, they can be used to model data in sorting algorithms, 
array indices in modeling distributed systems to model process ids and the states 
of processes, etc. [34,33]. 

Our next set of results consider axioms on functions. Associativity and com- 
mutativity are natural and fundamental properties of functions (like + and *) 
and are hence natural ways to capture/abstract using these axioms. (See [14] 
where such abstractions are used in program analysis.) We first show that verifi- 
cation of coherent programs is decidable when some functions are assumed to be 
commutative or idempotent. Our proof, similar to the case of reflexive and sym- 
metric relations, relies on reducing verification to the case without axioms using 
program instrumentation that capture the commutativity and idempotence ax- 
ioms. However, when a function is required to be associative, the verification 
problem for coherent programs becomes undecidable. This undecidability result 
was surprising to us. 

The decidability results established for properties of individual relation or 
function symbols discussed above can be combined to yield decidable verifica- 
tion modulo a set of axioms. That is, the verification of coherent programs with 
respect to models where relational symbols satisfy some subset of reflexivity /ir- 
reflexivity/symmetery/transitivity axioms or none, and function symbols are 
either uninterpreted, commutative, or idempotent, is decidable. 

Decidability results outlined above, apply to programs that are coherent mod- 
ulo the axioms/theories. However, given a program, in order to verify it using our 
techniques, we would also like to decide whether the program is coherent mod- 
ulo axioms. We prove that for all the decidable axioms above, checking whether 
programs are coherent modulo the axioms is a decidable problem. Consequently, 
under these axioms, we can both check whether programs are coherent modulo 
the axioms and if they are, verify them. 

There are several other results that we mention only in passing. For instance, 
we show that even for single executions, verifying them modulo equational ax- 
ioms is undecidable as it is closely related to the word problem for groups. And 
our positive results for program verification under axioms for functions (com- 
mutativity, idempotence), also shows that bounded model-checking under such 
axioms is decidable, which can have its own applications. 

Due to the large number of results and technically involved proofs, we give 
only the main theorems and proof gists for some of these in the paper; details 
can be found in [27]. 


2 Illustrative Example 


Consider the problem of searching for an element k in a sorted list. There are 
two simple algorithms for this problem. Algorithm 1 walks through the list from 
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assume (T Æ F); xy 

found := F; (s ) key 
stop :- F; 

exists :- F; 

sorted :- T; next 

while( x Z NIL) { T H 


key 
if( stop = F) then ( (es) (es) 
if( k = key(x)) then found := T; sorted 


if(k € key(x)) then stop := T; next 


; ! 
if( k = key(x)) then exists := T; F, stop, key 
y := next (x); : (9) © Q $ 
if( y# NIL) then { found, exists 
if( k(x) Z k(y)) then sorted := F; next 
} ! 
xc (e+) NIL 
} 


@post: sorted = T = found = exists ; 
< is {(es, ee), (e6, ez). (e7,e5)} 


Fig. 1. Left: Uninterpreted program for finding a key k in a list starting at x with « 
interpreted as a strict total order. The condition a < b is shorthand for a < b V a = b. 
Right: A model in which « is not interpreted as a strict total order. The elements 
in the universe of the model are denoted using circles. Some elements are labeled 
with variables denoting the initial values of these variables. The edges —»* represent 
subterm relation. Not all functions are shown in the figure. The model does not satisfy 
the post-condition on the program on left. 


beginning to end, and if it finds k, it sets a Boolean variable exists to T. Notice 
this algorithm does not exploit the sortedness property of the list. Algorithm 2 
also walks through the list, but it stops as soon as it either finds k or reaches an 
element that is larger than k. If it finds the element it sets a Boolean variable 
found to T. If both algorithms are run on the same sorted list, then their answers 
(namely, exists and found) must be the same. 

Fig. 1 (on the left) shows a program that weaves the above two algorithms 
together (treating Algorithm 1 as the specification for Algorithm 2). The variable 
x walks down the list using the next pointer. The variable stop is set to T when 
Algorithm 2 stops searching in the list. The precondition, namely that the input 
list is sorted, is captured by tracking another variable sorted whose value is T 
if consecutive elements are ordered as the list is traversed. The post condition 
demands that whenever the list is sorted, found and exists be equal when the 
list has been fully traversed. Note that the program's correctness is specified 
using only quantifier-free assertions using the same vocabulary as the program. 

'The program works on a data domain that provides interpretations for the 
functions key, next, the initial values of the variables, and the relation «. When 
« is interpreted to be a strict total order, the program is correct. However, 
if « is not interpreted as a total order, then the program may be incorrectly 
deemed as buggy. To see this, consider the data model shown on the right in 
Fig. 1. The data domain has 9 elements in its universe, with the functions next 
and key interpreted as shown. Initially, x, y have value e1, NIL is e4, K is ez, T 
and sorted are eg, and F,found,exists, and stop are eg. The interpretation 
of < is as follows — e5 < e6, eg < ez, and ez < es. Clearly < is not an order, 
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but the program’s sortedness check “sorted = T” will pass. After the entire 
list is processed, exists will be set to T when x = e3. On the other hand, 
stop will be set to T when x = ej because k = e; < key(x). Therefore, at the 
end found = F # exists. The work presented in [26], where all functions and 
relations are uninterpreted, would therefore declare this program to be incorrect. 
'The goal of this paper is to explore several natural restrictions on data models 
and study the problem of verifying coherent programs for them. When « is 
constrained to be a total order, the program in Fig. 1 is correct and coherent. Our 
results (see Section 5.5) show that verification of such programs when relations 
are constrained to be strict total orders is decidable, and hence we can build 
automatic decision procedures that will correctly verify such programs. 


3 Preliminaries 


We briefly recall the syntax and semantics of uninterpreted programs and the 
verification problem modulo axioms. Our presentation closely follows [26] and 
for lack of space, some details have been postponed to [27]. 


3.1 Program Syntax 


We consider imperative programs with loops over a fixed finite set of variables 
V and use constant (C), function (F), and predicate (R) symbols belonging to 
some first order signature X = (C, F, R). Programs are then given by the syntax 
below (f € F,R € R, x,y € V, z is a tuple of variables in V): 


(stmt) := | x := y | x := f(z) | assume((cond)) | skip | (stmt) ; (stmt) 
| while ((cond)) (stmt) | if((cond)) then (stmt) else (stmt) 
(cond) ::2x = y | R(z) | ^(cond) 


3.2 Executions and Semantics of Uninterpreted Programs 


Executions of programs over (stmt) are words over the following alphabet 


?» u 


II = {“x := y”, “x := f(z)”, *assume(x = y)", *assume(x £ y)", 


“assume(R(z))”, “assume(=R(z))” | v, y € V,z is in tuples(V)} 


For a program s € (stmt), the set of executions of s, denoted Exec(s) is a regular 
language over the alphabet JI and is given as follows (similar to [26]). 


Exec(skip) = € Exec(r := y) = “a := y" 

Exec(x :— f(z)) = “x := f(z)” Exec(assume(c)) = “assume(c)” 

Exec(if c then s, else s2) = “assume(c)” - Exec(s1) + “assume(-c)” - Exec(s2) 
Exec(s1; 52) = Exec(si): Exec(s2) 


» 


Exec(while c {s}) = [*assume(c)" - Exec(s1)]* - *assume(^c)" 
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The set of partial executions of s is the set of prefixes of words in Exec(s) and 
is also regular. 

A data model M = (Um, []m) for signature X is a first order structure 
where Um is a universe of elements and [].4 maps every symbol in X to their 
interpretations. Given a data model M over X, and an execution p € II*, the 
semantics of p on M is given by evaly, : H* x V > Um that gives the the 
valuation of variables in V at the end of an execution; the precise definition is 
standard and is defered to [27]. 


3.3 Feasibility of Executions Modulo Axioms 


An execution is said to be feasible in a data model, if every assumption made in 
the execution, holds on the model. More precisely, an execution p is feasible in 
M if for every prefix o’ = ø - “assume c" of p, we have 

(a) evala4(o, x) = eval,4(o, y) if c is ‘(a = y)’, 

(b) evalus(o, x) £ evalm (0, y) if cis ‘(£ Z y), 

(c) (evalm (0, 21), ..., evalu(o, 2,)) € [R]m if c is *R(z1,..., 25), and 

(d) (eval(o. z1), ..., evalm (0, zr)) € [R]m if c is ‘4R(21,..., 27). 

Let A be a set of first order sentences, including possible ground atomic 
predicates ?. We say that a data model M is an A-model, denoted M | A, if 
for every y € A, we have M H y. A formula ¢ is A-valid, denoted A E y, if ó 
holds in every model M that satisfies A. An execution p is said to be feasible 
modulo A if there is an A-model M such that p is feasible in M. 


3.4 Program Verification Modulo Axioms 


We consider programs annotated with post-conditions that are over the following 
syntax below. Here, x,y and z belong to the set of program variables V and 
R € R is a relation symbol in X. 


L:  Qu-z-y| R(z|eve|oe 


Definition 1 (Program Verification Modulo Axioms). For a program s 
and a set of axioms A, we say that s satisfies a postcondition q over the syntax 
£ modulo A if for every A-model M and for execution p € Exec(s) that is 
feasible in M, M satisfies plevalu(p,V)/V] (i.e., where each variable x € V is 
replaced by evalj4(p, V)). 


We remark that one can alternatively phrase the verification problem stated 
above in terms of feasibility. That is, a program s satisfies a postcondition y 
modulo A iff every execution p of s’ is infeasible modulo A (i.e., there is no 
A-model M such that p is feasible in M), where s' = s; assume(-9). 


2A ground atomic predicate is of the form tı ~ to, or R(ti,...th) or 7R(ti,...tk), 
where ~E {=, 4}, R is a relation symbol, and tis are ground terms. 
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4 Coherence Modulo Axioms 


In this section we extend the notion of coherence from [26], adapting it to our 
current setting where we restrict data models using axioms A. We will first recall 
the notion of terms computed by an execution, which will be used to define the 
notion of coherence. 


4.1 Terms Computed and Assumptions Accumulated by Executions 


We will associate a syntactic term TEval(p,x) with each variable x € V after a 
partial execution p. Intuitively, every variable x € V stores a constant term T in 
the beginning of an execution. New terms are computed on function computa- 
tions, ie. TEval(p- “a i= f(zi,...,2,)") = f( TEval(p, z1),..., TEval(p, z;.)). 
The precise definition is simple and is defered to [27]. The set of terms computed 
by an execution p is Terms(p) = ( TEval(p’, x) | p' is a prefix of p,x € V}. 

As an execution proceeds, it accumulates assumptions over the terms it com- 
putes, and we will use &(p) to denote the assumptions made by the execution p 
(see [27] for precise definition). For example, after an equality assume statement 
^assume(x = y)", we accumulate the atomic equality predicate Y = t; = ty, 
where ¢, and t, are terms associated with x and y when the assume statement is 
encountered. Similarly, for the execution p = p’- *assume(^R(z1, z9,...,2&)) , 
we have &(p) = &(p') U (5R( TEval(p’, z1),..., TEval(p', z,))}- 


4.2 Coherence 


Our definition of coherence modulo axioms is a smooth generalization of the def- 
inition of coherence in [26]. The notion of coherence consists of two properties — 
memoizing and early equality assumes. The memoizing property says, intuitively, 
when a term t is computed after executing some prefix o of an execution, if t is 
equivalent to some other term modulo the assumptions made in the execution so 
far, then t must not have been dropped at the end of c, i.e., a program variable 
must already hold this term. We replace the notion of equivalence of terms in 
this definition by equivalence modulo the axioms as well. 

The notion of early assumes in [26] intuitively says that assumptions of equal- 
ity (on terms t, and t2) should be encountered early — earlier than dropping any 
superterm of tı or tg. This notion of early assumes allows for effectively comput- 
ing congruence closure on the set of terms computed by the execution, which in 
turn, is necessary to accurately maintain which terms are equivalent. However, 
we observe that the notion in [26] is too restrictive and not entirely necessary. In 
our paper, we generalize this notion in several ways, to a more semantic one as 
follows. Whenever an execution encounters an assumption of equality between 
two term, we instead demand that only the equivalences that are additionally 
implied by this new assumption, can be infered locally using the already known 
congruence between terms in the window, i.e., the set of terms pointed to by the 
program variables when the equality assumption is encountered. Next, we incor- 
porate axioms into this definition, by requiring that the notion of equivalence is 
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also modulo the axioms, and further require that all assumptions (equality, dis- 
equality, relational) are required to be early (as against only restricting equality 
assumptions to be early like in [26]). We will elaborate on these differences using 
an example after presenting the formal definition next. 

Given a set of first order sentences /' and ground terms tı and t2, we say 
that ty =r to if I E ty = to. 


Definition 2 (Coherence modulo axioms). Let A be a set of axioms and 
let p be a complete or partial execution over variables V. Then, p is said to be 
coherent modulo A if it satisfies the following two properties. 


Memoizing. Let 7 =o- *r:—f(z)" be a prefix of p and lett = TEval(z, x). If 
there is a term t' € Terms(c) such that t' 4 (5) t, then there must exist 
some variable y € V such that TEval(c,y) &* A (o) t. 

Early Assumes. Let 7 = o - “assume(c)” be a prefix of p, where c is any of 
z—y,czy,R(z), or R(z). Lett € Terms(a) be a term computed in o such 
that t has been dropped, i.e., for every x € V, we have TEval(a, z)2 punto) 
For any term t' € Terms(o), if t t unr) t, then t = 4,405, t. 


Remark. We remark that every execution that is coherent as per the defi- 
nition in [26], is also coherent modulo A = Ø as in Definition 2. However, the 
converse is not true and we illustrate this difference below. 


Example 1. Let us now illustrate the notion of coherence in the presence of 
axioms using the execution p below. 


p—zic—f(xy) zo:—f(y,x):za:— g(zi): za :— g(za2) : za :— zs + ze :— g(zi) 


Let p; denote the prefix of p of length i. Here, TEval(p3,z3) = g(f(X.y)), 
TEval(ps,23) = Z5 # g(f(X,y)) and TEval(ps,ze) = g(£(X, ¥)). When the set 
of axioms is A = Ø, this execution is not coherent modulo A as it violates the 
memoizing requirement at the last statement zg :— g(z1) (no variable stores the 
term g(f(x, y)) after ps). 

Now, consider the axiom set denoting commutativity of f, i.e., Acomm = 
{Vu,v.f(u,v) = f(u,v)). In this case, we observe that f(X,y) Sa... f(y.X) 
and thus g(£(,3)) ~Aom g(£(9,8)). Also, TEval(ps,za) = g(f(9.3)) As. 
g(£(x,y)). This ensures that p is indeed coherent modulo JAcomm. 


Let CoherentExecs( X, V, A) denote the set of executions over the signature X 
and variables V that are coherent modulo the set of axioms A. 


Definition 3. A program s over signature X and variables V is said to be co- 
herent modulo A if Exec(s) C CoherentExecs( X, V, A). 


In this paper, we explore several classes of axioms, studying when the verifi- 
cation problem for coherent programs modulo the axioms is decidable. 
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5 Axioms over Relations 
In this section, we investigate the decidability of the verification problem for 


coherent programs modulo relational axioms, i.e., axioms which only involve 
relation symbols R in the signature X. 


5.1 Verification modulo EPR axioms 


A first-order formula is said to be an EPR formula [37] if it is of the form 


da... 04VY1,---Ym P 


where y is quantifier-free and purely relational (uses no function symbols). 

It is well known that satisfiability of EPR formulas is decidable, in fact by 
a reduction to Boolean satisfiability [24]. Consequently, the problem of checking 
whether a single execution is feasible under axioms written in EPR can be shown 
to be decidable, and has been exploited in bounded model-checking. 

Consequently, we could reasonably ask whether verification of coherent pro- 
grams under EPR axioms is decidable. Surprisingly, we show that they are not 
(proof details can be found in [27]). 


Theorem 1. Verification of uninterpreted coherent programs modulo EPR az- 
ioms is undecidable. 


Given the above result, we turn to several classes of quantified axioms, which 
are all expressible in EPR (and hence have a decidable bounded model checking 
problem) and examine their decidability for coherent program verification. 


5.2 Reflexivity, Irreflexivity, and Symmetry 


We consider program verification under the following axioms (individually): 


gra S Vr. R(x, 2) (reflexivity) 
pe. = Ve sR, a) (irreflexivity) (1) 


Perm £ Vr,y.R(z,y) => R(y,z) (symmetry) 


We show that verification is decidable modulo these axioms using a technique 
that we call program instrumentation. Let us fix a relation R and an axiom ue. 
where p € {refl, irref, symm}. The idea is to find a function (in fact, a string 
homomorphism) i such that for any program P, P is correct/coherent modulo 
{ph} iff AZ ( Exec(P)) is correct/coherent modulo the empty axiom set. Decid- 
ability then follows by exploiting the results of [26]. The function hie will capture 
the properties of the axiom it is trying to eliminate, and so it will be different 
for different axioms. We first outline these function he, then state their property 
and prove the decidability result. 
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Fig. 2. Implied negative relational assumes for a transitive relation R. The dashed 
edges (--») represent the inferred relationship implied from the relations marked by 
bold edges (—»). 


For reflexivity, we transform an execution p of P to p’ where p’ is essentially 
p, except that whenever we see the computation of a term, using an assignment 
of the form “x :— f(z)”, we immediately insert an assume statement that states 
that R(r,x) holds. More precisely, the homomorphism is defined as, 


LR. (a) = 
ren (a) otherwise 


fo "assume(R(r,z)) ifa = “x := f(z)” 
The homomorphisms used for irreflexivity and symmetry follow similar lines and 
are outlined in [27]. 


Theorem 2. For any relation symbol R and p € {refl, irref, symm}, the problems 
of coherent verification modulo (7) and checking coherence modulo {p} are 
PSPACE-complete. 


5.3 Transitivity 


We now consider the transitivity axiom for a relation R which says 
pre = Vr,y,z: R(z,y) ^ R(y,z) —9 R(z,z) (transitivity) (2) 


'The proof for decidability modulo this axiom is different and more complex 
that the proofs for reflexivity, irreflexivity, and symmetry. Intuitively, the pro- 
gram instrumentation approach does not seem to work for transitivity. This is be- 
cause transitivity effects can be global. For example, we may have that the execu- 
tion asserts the sequence of relational assumes R(t1, t2), R(to,t3),... R(t, 1, tn) 
(here, t1,...t, are terms computed by the execution), where some of the in- 
termediate terms may have been dropped by the program (i.e., the variables 
holding these terms were reassigned). Consequently, relating tı and (the possi- 
bly newly constructed term) tn requires a principally new machinery. We modify 
the automaton construction from [26] so that it maintains the transitive closure 
of the assumptions the program makes. Our main observation is the following: 


Theorem 3. Let X be a first order signature and V a finite set of program 
variables. Let A = (oH. | R € Rirans} for some set of relation symbol Rerans in 
X. The following observation hold. 
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1. There is a finite automaton Frans (effectively constructable) of size O(2Po1V) 
such that for any coherent execution p that is coherent modulo A, Ftrans ac- 
cepts p iff p is feasible. 

2. There is a finite automaton Crrans (effectively constructible) of size Ogre 1)) 
such that L(Ctrans) = CoherentExecs( X, V, A). 


Proof Sketch. These are in some sense a generalization of the automata con- 
structions used to establish decidability in [26].The automata Juan; and Ctrans 
rely on tracking equivalence between values stored in variables, and functional 
and relational correspondences between these values. However, now since some 
relations maybe transitive, additional relational correspondences (or their ab- 
sence) maybe implied for R € gas. The basic idea is to maintain for tran- 
sitive relations R (a) the transitive closure of the positive relation assumes 
assume(R(.,-)), and (b) the negative relational assumes implied by the rela- 
tional assumes seen in an execution. More precisely, if the execution sees assumes 
assume( R(x, y)) and assume(R(y, z)), then we also add the constraint R(x, z) 
in the automaton's state. Further, if the execution observes assume( R(x, y)) and 
assume(- (rz, z)), then one can infer the constraint ^R(y, z), and in this case, 
we accumulate this additional constraint in the state of the automaton. Sim- 
ilarly, if the execution observes assume(R(y, z)) and assume(—R(z, z)), then 
one can infer the constraint ^A(z, y), which is added in the automaton's state. 
Both these scenarios are illustrated in Fig. 2. A detailed proof is in [27]. 


As a consequence we have the following result. 


Theorem 4. For A = {p®,,, | R € Rtrans}, the problems of coherent verification 
modulo A and checking coherence modulo A are PSPACE-complete. 


5.4 Strict Partial Orders 


We now turn our attention to axioms that dictate that certain relations be 
partial or total orders. The anti-symmetry axiom that holds for non-strict orders 
introduces subtle complications. Recall that R is anti-symmetric if Vx, y. R(z, y)^ 
R(y,r) = x = y; this axiom can imply equality between terms if R holds 
between a pair of terms. Concretely, if R is anti-symmetric, and the program 
makes assumptions in an execution that R(ti,t2) and R(t2,t1) hold, then any 
model in which such an execution is feasible must also ensure that f, = tə. 
This implicit equality assumption interferes with the notions of coherence and 
the automata constructions (proofs of the results in [26] and Theorem 4) that 
compute a congruence closure on terms in a streaming fashion. 

Hence, we only consider strict partial orders in this section. Recall that a 
relation R is a strict partial order if it satisfies the irreflexivity axiom and the 
transitivity axiom, together denoted A& . We can prove decidability for prob- 
lems modulo AS o by using our algorithm for irreflexivity and transitivity. 


Theorem 5. The following problems are PSPACE-complete. 
1. Given a program P that is coherent modulo Apo, determine if P is correct. 
2. Given a program P, determine if P is coherent modulo ARo 


170 U. Mathur et al. 


5.5 Strict Total Orders 


A relation R is a strict total order if it is a strict partial order and satisfies: 
Va,y-cxy => R(z,y)V R(y,x) (totality) (3) 


Strict total orders are again tricky to handle as the axiom for totality can 
result in implicit equality between terms. For example, if ^ R(z, y) and ^R(y, x) 
then it must be the case that x = y. However, if we restrict ourselves to execu- 
tions that only have assumes of the form assume(R(z, y)) and do not have any 
assumes on =R, i.e., of the form assume(-Z(z, y)) then there are no implicit 
equalities that are entailed. 

Unfortunately, in general, program executions can contain negative assumes 
on R (ie., assumes of the form assume(-R(z, y))). In order to ensure that 
executions contain only positive assumptions on R, we must be careful when 
identifying executions of programs with conditionals — branches where the as- 
sumption —R(x,y) holds must be translated to a branch that assumes R(y, x) 
and a branch that assumes x = y. We present a detailed translation in [27]. 

After such a translation, executions can now have additional equality as- 
sumes even if they did not appear in the program. When we refer to coherent 
programs, we mean that they are coherent according to the above modified no- 
tion of executions. This means for such programs to be coherent, all executions 
must ensure that the additional equality assumes are early. And when we talk 
about coherent verification of programs with total orders, we mean verification 
for programs that are coherent after this transformation. 

We observe that in the absence of any assumes of the form —R(rz, y) the ver- 
ification problem modulo strict total orders reduces that modulo strict partial 
orders, giving us the following (Ags denote the axioms of irreflexivity, transi- 
tivty and totality for the relation R). 


Theorem 6. The problems of coherent verification, and checking coherence mod- 
ulo Azo are PSPACE-complete. 


6 Axioms Over Functions 


We now discuss computational problems modulo axioms that involve function 
symbols. The treatment of axioms involving functions in the verification of co- 
herent programs is inherently hard. This is because, like in the case of (nonstrict) 
partial orders and strict total orders, the axioms along with the assume-steps 
in the execution, can imply equalities between terms beyond those entailed 
by just the assume steps in the execution. For example, consider the axiom 
Vz, y- f(x,y) = f(y,x) constraining f to be a commutative function. Then 
terms like f(f(x,y),z) are equal to terms like f(z, f(z,y)), and hence when 
building models we must make sure that functions/relations on such terms are 
defined in the same way. Terms made equivalent by the functional axioms can be 
syntactically very different, and keeping track of the equivalence on unbounded 
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executions is hard using finite memory. We consider many natural classes of 
axioms, and proving both positive and negative results that help delineate the 
decidability /undecidability boundary. 


6.1 Associativity 


We now consider the associativity axiom for a function f. 


Ph coc il] Y, z: f(z, f(y; z)) = FG Gy). z) (associativity) (4) 


We show, surprisingly to us, that coherent verification is undecidable modulo 
[oue], i.e., even when we have only one axiom that requires only one function 
to be associative. In fact, the situation is a lot worse — checking the feasibility 
of even a single (even coherent) execution is undecidable, in the presence of a 
single associative function. The proof of the following result uses a reduction 
from the word problem for finitely generated semigroups [36]. 


Theorem 7. Given a a trace p that is coherent modulo (i1...) , it is undecidable 
to determine if p is feasible. Therefore, the problem checking if a program P that 
is coherent modulo {f,,,-} is undecidable. 


6.2  Commutativity 
We now consider the commutativity axiom, which is the following 
PLam S Vr. y - f(2,y) = f(y, v) (commutativity) (5) 


We augment executions with an auxiliary variable v* ¢ V and transform execu- 
tions using the following homomorphism that uses the auxiliary variable v* 

a: *v* := f(y, x)’ - *assume(z = v*)" ifa = “z := f(x,y)” 

a otherwise 


We show that the above transformation preserves feasibility and coherence, 
giving us the following result. 


Theorem 8. Verification of coherent programs and checking coherence modulo 
commutativity axioms is decidable and is PSPACE— complete. 


6.3 Idempotence 
Next we consider the idempotence axiom for a unary function f: 

lem Y2 - f(x) = f(F(a)) (idempotence) (6) 
Again, we show that there is a simple homomorphism Pd vem that preserves co- 


herence and feasibility (see [27]) and reduces verification to one without axioms. 


Theorem 9. Verification of coherent programs and checking coherence modulo 
idempotence axioms is PSPACE-complete. 
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7 Combining Axioms 


We have thus far proved decidability results when a relation or functions satisfies 
certain properties like reflexivity /irreflexivity /symmetry /transitivity or commu- 
tativity/idempotence. We now show that all of these results can be combined. 
That is, we can consider a signature where relations and functions are assumed 
to satisfy some subset of these properties, and with some being uninterpreted, 
and the verification problem will remain decidable for coherent programs. 


Theorem 10. Let A be a set of axioms where each relation symbol R is ei- 
ther a total order or satisfies some (possibly empty) subset of properties out of 
reflexivity, irreflexivity, symmetry, transitivity, and each function symbol f sat- 
isfies some (possibly empty) subset out of commutativity and idempotence. The 
verification problem for coherent programs modulo A is PSPACE-complete. 


The proof of the above result proceeds by eliminating axioms one at a time. 
We first eliminate the relational axioms (reflexivity, irreflexivity, symmetry) in A 
using program instrumentation. We then eliminate the functional axioms in A, 
again using program instrumentation. Our proof relies on this order of elimina- 
tion of axioms. At this point, the only axioms remaining are those corresponding 
to transitivity of a subset of relational symbols, which is handled using the au- 
tomata construction discussed in the proof of Theorem 3. 


8 Related Work 


The theory of equality with uninterpreted functions (EUF) is a widely used the- 
ory in many verification applications as it has decidable quantifier free fragment. 
EUF has been central to advances in verification of microprocessor control [6,4] 
and hardware verification [1,19] and property directed model checking [18]. EUF 
has been used as a popular abstraction in software verification [2,3]. Uninter- 
preted functions have also been studied for equivalence checking and translation 
validation [35]. Bueno et al [5] demonstrated the effectiveness of uninterpreted 
programs for verifying SVCOMP benchmarks against control flow properties. 
Mathur et al [26] introduced the class of coherent uninterpreted programs 
and showed that verification of coherent programs, with or without recursive 
function calls, is a decidable problem. This is one of the few subclasses of pro- 
gram verification over infinite domains that is known to be decidable. Previous 
works [13,14,31] have established decidability of verification of classes of uninter- 
preted programs with heavy syntactic restrictions such as disallowing condition- 
als inside loops or nested loops, etc. As noted in [26], the notion of coherence is 
close to the notion of a bounded pathwidth decomposition [38]. A term that is 
created in a coherent execution stays within some program variable (modulo con- 
gruence) until the first time all variables containing that term are over-written, 
and after this point, the execution never computes it again, and thus, the set of 
windows that contain a term form a contiguous segment of the program execu- 
tion. Path decomposition and the related notion of tree decomposition have been 
exploited many times in the literature to give decidability in verification [25,7,8]. 
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The work in [28] extends the work of [26] to updatable maps and identifies 
extensions of coherence that make verification decidable. It utilizes this to pro- 
vide implementation of verification algorithms for memory safety for a class of 
heap manipulating programs, including traversal algorithms on data structures 
such as singly linked list, sorted lists, binary search trees etc. Combining the 
results of this paper with these results is an interesting future direction. 

The class of EPR formulas that consist of universally quantified formulas 
over relational signatures is a well-known decidable class of first-order logic [37]. 
EPR-based reasoning has been proved powerful for verification of large-scale sys- 
tems [33,29,39] and the Ivy [34,30] system is one of the most notable framework 
that exploits EPR based reasoning for verifying program snippets without recur- 
sion. EPR encoding of order axioms such as reflexivity, symmetry, transitivity 
and total orders has been used in proving programs working over heaps [20]. 

The work in Kleene Algebra with Tests (KAT) [22] considers problems in- 
volving unbounded recursion and choice with abstractions of data, similar to our 
work. However, while we treat congruence axioms for equality faithfully in our 
work, it is unclear to us how to express these in KAT or its extensions [21,23,9]. 
Furthermore, the restrictions of coherence studied in [26] and the work here that 
are based on bounded path-width notions seem very different from studies of 
decidable problems in KAT. A study of whether our results can be adapted to 
yield decidable fragments for KAT is an interesting future direction. 

A notable verification technique with an automata-theoretic foundation and 
that has been very effective in practice is that of trace abstraction due to Heiz- 
mann et al [15,16,17,10,11,12]. In this technique, one constructs iteratively regu- 
lar sets that (incompletely) capture the set of all infeasible executions, eventually 
striving to cover all failing executions of a program, but handling complex the- 
ories such as arithmetic. In contrast, our work builds complete automata in one 
stroke that accept all infeasible traces over a vocabulary, but handles only simple 
theories with restricted sets of axioms, but yielding decidability. Combining these 
lines of work for efficient software verification is an interesting future direction. 


9 Conclusions 


By incorporating axioms on functions and relations, decidability results in this 
paper, enable a more faithfully automatic verification of programs. It is worth 
noting that the upper bound for all our decidability results is PSPACE, which 
is the same as that for Boolean programs. Thus, though we consider programs 
over infinite domains with additional structure, our verification results have the 
same complexity as that for programs over Boolean domains. 

One future direction is to adapt this technique for practical program veri- 
fication. In this context, adapting our technique within the automata-theoretic 
technique of [15,17,16,12,10] seems most promising. Second, there are several 
program verification techniques that use EPR, and in several of these, EPR 
is used mainly to establish a linear order on the universe [20]. Automatically 
verifying such programs using our technique is worth exploring. 
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Abstract. We present a formalized proof of the regularity of the infinity 
predicate on ground terms. This predicate plays an important role in the 
first-order theory of rewriting because it allows to express the termination 
property. The paper also contains a formalized proof of a direct tree 
automaton construction of the normal form predicate, due to Comon. 
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1 Introduction 


Term rewriting [1,18] is an abstract model of computation which underlies much 
of declarative programming and automated theorem proving. The foundation of 
rewriting is equational logic. Equations are used from left to right to direct the 
search for proofs. Fundamental properties like confluence (which ensures that 
different computation paths produce the same result) and termination (all com- 
putation paths produce a result) are undecidable in general. For terminating 
systems, one is interested in estimating the resources needed to evaluate expres- 
sions (space and time complexity). Much progress has been made in establishing 
sufficient and automatable criteria for confluence, termination, complexity, and 
other properties of rewrite systems. These criteria have been implemented in 
highly optimized automatic tools that compete on a yearly basis [12,13]. These 
competitions, together with the recent advances in SAT [4] and SMT [2] solving, 
have on the one hand led to specialized techniques that are especially suitable for 
automation. On the other hand, software bugs observed in the tools gave rise to 
the more recent activity of certification of the output of termination, complexity, 
and confluence tools. This is done by formalizing the underlying methods in an 
interactive proof assistant like Coq [3] or Isabelle [15], and using the code gen- 
eration facilities of these proof assistants to obtain trustworthy programs that 
can certify the output of the tools. 

In this paper we are concerned with the formalization of methods that are 
used in FORT [16,17], a tool that implements the first-order theory of rewriting 
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for the decidable class of left-linear, right-ground rewrite systems. FORT can be 
used to decide properties of a given rewrite system and to synthesize rewrite 
systems that satisfy arbitrary properties expressible in the first-order theory of 
rewriting. The decision procedure is based on tree automata techniques and goes 
back to a paper by Dauchet and Tison [7]. In a recent paper [10] the authors 
formalized results concerning ground tree transducers and RR, automata for a 
fragment of the first-order theory that allows to express confluence, resulting in 
a formalized confluence prover for left-linear, right-ground rewrite systems. In 
this paper we cover the infinity predicate that is crucial for expressing the termi- 
nation property in the first-order theory of rewriting and an efficient automaton 
construction of the normal form predicate that is employed in FORT. The former 
goes back to a technical report by Dauchet and Tison [8] and the latter is based 
on a paper by Comon [5]. The normal form predicate has other applications as 
well (e.g. [9,14]). A proof of the construction of [8] is given in [16], but this proof 
contains a serious mistake that we report at the end of Section 3. 


Our formalizations are based on IsaFoR [19], an Isabelle/HOL library con- 
taining numerous abstract results and concrete techniques from the rewriting 
literature. Our own development can be found at 


http:/ /cl-informatik.uibk.ac.at/software/fortissimo /tacas2020/ 


Most definitions, theorems, and lemmata in this paper directly correspond to 
the formalization. These are indicated by the (V symbol, which links to a HTML 
presentation in the PDF version of the paper. 


In the next section we recall basic definitions, notation, and results concerning 
term rewriting and tree automata that we need in the sequel. In Section 3 we 
present our first main result, a formalized correctness proof of the regularity 
of the infinity predicate for regular relations. The tree automaton constructed 
in the correctness proof is not directly executable due to the definition of Qə 
which plays an important role in the construction of the tree automaton. In 
Section 4 we present our second main result, an equivalent definition of Q that 
is constructive. Our third result, a formalized correctness proof of an efficient 
tree automata construction of the normal form predicate for left-linear rewrite 
systems, is the topic of Section 5. We conclude in Section 6 with some statistics 
of our formalizations as well as a list of tasks that remain to be done for a 
certified version of FORT. 


When we write “formalized” we always mean "formalized in Isabelle/HOL.” 


2 Preliminaries 


Familiarity with term rewriting [1] and tree automata [6] is useful, but we briefly 
recall important definitions and notation that we use in the remainder. 


We assume a given signature F and a set of variables V. Function symbols 
in F are equipped with a fixed arity. Function symbols of arity zero are called 
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constants. The set of terms built from F and V is denoted by 7(J,V) and 
inductively defined: A term is either a variable x € V or f(ti,...,tn) for a 
function symbol f of arity n and terms t4,...,t, € 7 (.F, V). The set of variables 
occurring in a term 1 is denoted by Var(t). A term t with Var(t) = Ø is called 
ground. We write 7 (.F) for the set of ground terms. Positions are strings of 
positive integers which are used to address subterms. The empty string is called 
root position and denoted by e. The set of positions in a term t is denoted by 
Pos(t) and the subterm of t at position p € Pos(t) by t|,. We write s < t if s 
is a proper subterm of t, i.e., s = t|; with p Z e. We write ¢[u], for the result 
of replacing the subterm of t at position p with the term u. The root symbol of 
a term t is denoted by root(t) and t(p) denotes root(t|,). We write p < q if p 
is a proper prefix of q. A context C is a term with a hole O. Here O ¢ F isa 
special constant. We write C'|t] for the result of replacing the hole in C by t. A 
substitution c is a mapping from variables to terms. We write to for the result 
of applying o to the term t. 


A term rewrite system (TRS for short) R consists of rewrite rules l > r 
between terms / and r over the same signature F such that Var(r) C Var(é). 
The rewrite relation + is defined on terms as follows: s +p t if there exist 
a position p € Pos(s), a rewrite rule l > r € R, and a substitution ø such 
that s|, = £o and t = s[ro],. The reflexive transitive closure of +r is denoted 
by >. A redex is a substitution instance of a left-hand side of a rewrite rule. 
Terms that contain a redex as subterm are called reducible. A normal form is a 
term without redexes. We write NF(R) for the set of ground normal forms of R. 
In this paper we consider finite TRSs over finite signatures. The TRSs handled 
by FORT are left-linear (no duplicate variables in left-hand sides of rewrite rules) 
and right-ground (no variables in right-hand sides of rewrite rules). 


We now recall some basic notions related to tree automata. A tree automaton 
is a quadruple A = (F, Q, Qs, A) consisting of a finite signature F, a finite set Q 
of states, disjoint from F, a subset Qf C Q of final states, and a set of transition 
rules A. Every transition rule has one of the following two shapes: 


m f(pi,..., pa) > q with f € F and pi,...,.pa.,q € Q, or 
— p> q with p,q € Q. 


Transition rules of the second shape are called epsilon transitions. We write Ae 
for the set of epsilon transitions. Furthermore, A, = A 4e. Transition rules 
can be viewed as rewrite rules between ground terms in 7 (F U Q). The induced 
rewrite relation is denoted by >, or —4. A ground term t € 7 (.F) is accepted 
by A if t =% q for some q € Qg. The set of all accepted terms is denoted by L(A) 
and a set L of ground terms is regular if L = L(.A) for some tree automaton A. 


Let A = (F,Q, Qp, A) be a tree automaton. A state q € Q is reachable if 
t =>% q for some term t € 7(.F). We say that q is productive if Clq] >% qf 
for some ground context C and final state qf € Qy. The automaton A is trim 
if all states are both reachable and productive. Any tree automaton can be 
transformed into an equivalent trim automaton. This result has been formalized 
in IsaFoR by Felgenhauer and Thiemann [11]. 
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Below we present a formalized proof of a version of the pumping lemma that 
we need later. 


Lemma 1. Let A = (7,Q, Qr, A) be a tree automaton and t >% q with t € 
T (.F) and q € Q. If height(t) > |Q| then there exist contexts Cı and C2 7 
a term u, and a state p such that t = Ci[C»[u]], u >% p, Colp] >^ p. und 
Cilp] > «- 


Proof. From the assumptions t =>% q and height(t) > |Q| we obtain a sequence 
(5, sinas dre Qna, Di. Dn) consisting of ground terms, states, and 
non-empty contexts with n > |Q| such that 


— ti >h qi for ali n+l], 
— Di[t;] = ti+ı and Di[g;] >% gigi for all i < n, and 
— qn+1 = q and i444 =t 


by a straightforward induction proof on t. Because n > |Q| there exist indices 
1 <i< j <n such that q; = qj. We construct the contexts C1 = D,[...[D;]...] 


and C2 = Dj. 4[...[Di]...]. Note that C2 # O as i < j. We obtain Co[g;] >% qj 
and Ci[g;] > qn+1 by induction on the difference j — i. By letting p = qi = qj 
and u = t; we obtain the desired result. v 


We conclude this preliminary section with a brief account of RR» relations, 
which are binary relations on ground terms over a signature F whose encoding 
as sets of ground terms over the extended signature F°) = (F U (1))? with a 
fresh constant L ¢ F is regular. The arity of a symbol fg € F°) is the maximum 
of the arities of f and g. The encoding of two terms t,u € T (.F) is the unique 
term (t, u) € T (7?) such that Pos((t, u)) = Pos(t) UPos(u) and (t, u)(p) = fg 


where 


fe t(p) if p € Pos(t) | Ju(p) if p € Pos(u) 

L otherwise = AE otherwise 

for all positions p € Pos(t) U Pos(u). We illustrate this on a concrete example. 
For the ground terms t = f(g(a), f(b, a)) and u = f(a, g(g(b))) we obtain (t, u) = 
ff(ga(a.L), fe(bg(.Lb), a.L)). A tree automaton operating on terms in 7 (F?) is 
called an RR2 automaton. The two projection operations effectively transform 
RR» relations on 7 (F) to regular subsets of 7 (7). 


3 Infinity Predicate 


The following formula in the first-order theory of rewriting expresses the termi- 
nation property:? 


Vt FIN_,+(t) ^ 2du(u—* u) 


? The formula characterizes termination of all rewrite systems R with the property 
that the induced rewrite relation > is finitely branching. 
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The predicate FIN_,+ holds for t € 7 (.F) if there are only finitely many terms 
u € T(F) such that t ++ u. We consider its complement as it leads to smaller 
automata: 


=a 


t (INF_,+(t) V t 37 t) 

with INF_,+ = {t € 7(F) | t >} u for infinitely many terms u € T(F)}. 
Definition 1. Let o be an arbitrary binary relation on T(F). We write INF, 
for the set (t € T(F) | (t, u) € o for infinitely many terms u € T(F)}. 


In [8] the construction of a tree automaton that accepts FIN. for an arbi- 
trary RR» relation o? is given. In [16, Appendix A] a correctness proof of the 
construction is presented, which contains a serious mistake (reported at the end 
of this section). In this section we give a rigorous and formalized proof of the 
regularity of INF, for arbitrary RR relations o. 


Theorem 1. The set INF. is regular for every RR relation o C T (.F) x T (7). 
The following definition originates from [8]. 


Definition 2. Given a tree automaton A = (FO),Q, Qr, A), the set Qo C Q 
consists of all states q € Q such that (L,t) >% q for infinitely many terms 
te T(F). 


Example 1. Consider the binary relation 
o = ((f(a.g"(b)),g"(f(a,b))) | n = 2 and m > 1 or n > 3 and m= 1} 


The encoding of o is accepted by the RR? automaton A = (F9, Q, Qr, A) with 
F = {a,b,f,g}, Q = {0,...,11}, Qf = {0}, and A consisting of the following 
transition rules: 


fg(1,2) > 0 Lf(8,4) > 5 gl(6) 2 bl +7 
fg(8,9) — 0 lg(5)—5 gl(7) >6 bl 11 
af(3,4) 21 la>3  gl(1029 ag(5) 2 1 
af(3,4) > 8 1b—4  gl(1 010  gl(l))11 


For instance, 


(f(a, g(g(b))). g(f(a, b))) = fg(af(-La, Lb), g-L(g.L(b-L))) 
=>% fg(af(3, 4), g.L(g.L(7))) >% fg(1,g.L(6)) >a fg(1,2) —A0 


but (f(a, g(b), f(a, b))) = ff(aa, gb(b.L)) is not accepted. 
We have Qs, = {5}. State 5 is reached by (L,g”(f(a,b))) for all n > 0. 
3 The relation >$} is an RR» relation for left-linear, right-ground TRSs R. Other uses 
of FIN (INF) can be found in [16]. 
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Definition 3. (V Given a tree automaton A = (F),Q, Qr, A), we define the 
tree automaton As, = (F, QUQ, Qs, AU A). Here Q is a copy of Q where 
every state is dashed: q € Q if and only if q € Q. For every transition rule 
fgl, ---,qn) > q € A we have the following transition rules in A: 


fg(q....,d.)—5 d fd Qo andf — LL (1) 
f9(qu .... dicii dis diss ++ Un) — q for all 1 <icn (2) 


Moreover, for every e-transition p —^ q € A we add 
pd (3) 
to A. We write A’ for AU A. 


Dashed states are created by rules of shape (1) and propagated by rules of 
shapes (2) and (3). The above construction differs from the one in [8]; instead 
of (1) the latter contains fg(qi,...,q4) > q if qi € Qs; for some i > arity( f). In 
an implementation, rather than adding all dashed states and all transition rules 
of shape (2), the necessary rules would be computed by propagating the dashes 
created by (1) in order to avoid the appearance of unreachable dashed states. 
When A, is used in isolation, a single bit suffices to record that a dashed state 
occurred during a computation. 


Example 2. For the tree automaton A from Example 1 we obtain Aj by adding 
the following transition rules (the missing rules of shape (2) involve unreachable 
states): 


1f(3,4) 35  L1g(5—5  1g(b—5 ag(5)>1 fg(1,2) 0 


The unique final state of A. is 0. We have (f(a, g(g(b))), g(f(a, b))) € L(A) 
but there is no term u such that (f(a, g(b)), u) € L(A). 


'The following preliminary lemma is proved by a straightforward induction 
argument. 


Lemma 2. If t —4 p then t —4.. p. If Clp] 24 q then Clp] 2À., 4 and 
Clip] >, T CAA 


Theorem 2. Suppose o is accepted by the RR2 automaton A. If t € INF, then 
(t,u) € L(.As;) for some term u € T(F). 


Proof. From t € INF, and o = L(A) we obtain (t, u) € L(A) for infinitely many 
terms u € 7 (F). Since the signature is finite, there are only finitely many ground 
terms of any given height. Moreover, height((¢, u)) = max (height(t), height(u)). 
Hence there must exist a term u € 7 (F) with (t, u) € L(A) such that 


height(t) + |Q| + 1 < height(u) 
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This is only possible if there are positions p and q such that p ¢ Pos(t), pq € 
Pos(u), and |Q| < |q|. From Pos((t, u)) = Pos(t) U Pos(u) we obtain (t,u)|; = 
(L, ulp). Since (t, u) € L(A) there exist states r € Q and q € Qy such that 


(t, u) = (t, u) [Cos ulol 4 t url» >) ar 


where we assume without loss of generality that the last step in the subsequence 
(L,u|p) +4 r uses a non-epsilon transition rule. 


From |Q| < |g| and pq € Pos(u) we infer |Q| < height((.L, u|;)). Hence we 
can use the pumping lemma (Lemma 1) to conclude the existence of infinitely 
many terms v € T(F) such that (L,v) +4 r. Hence r € Qs, by Definition 2. 
Since the last step (L,u|p) —4 r uses a non-epsilon transition rule, we obtain 
(L, u|p) —À., 7 using Lemma 2 and a final application of a rule of shape (1). Also 
using Lemma 2 we obtain (t, u)|rl, >à qr as (t, u)|rlp —4 qr. We conclude 
(t, u) € L(A) as desired. v 


For the reverse direction of Theorem 3 we need two auxiliary results. The 
first result is proved by a straightforward induction argument. Here the mapping 
e: T(F?) U QUQ)  T(FO) UQ) erases all dashes from states. 


Lemma 3. [ft € TF? UQUQ) and t 24. p then y(t) 34 o(p). v 


With a little bit more effort, we obtain the second auxiliary result. The key 
step in the proof is identifying the rule of shape (1) that is used to create the 
first dashed state. 


Lemma 4. If t € T(F®) and t 24. p then there exist a state q € Qs, a 
context C, and a term s such that C[s] = t, root(s) = Lf for some f € F, 
s >j d, and Clq] >å, P- v 


Theorem 3. Suppose o is accepted by the RRo automaton A. If (t,u) € L(A) 
for some term u € T (.F) then t € INFo. 


Proof. From (t, u) € L(As;) we obtain a final state qj € Q with (t,u) 24. dy. 
Using Lemma 4, we obtain a context C, a term s with root(s) = Lf for some 
f € F, and a state q € Qs; such that C[s] = (t, u), s 4. q, and Clq] >å. Tr- 
Let p be the position of the hole in C. From C[s] = (t, u) and root(s) = Lf, 
we infer p € Pos(w) XV Pos(t). Since q € Qs the set (v € T(F) | (L,v) 24 q} 
is infinite. Hence the set S = (u[v]y € T(F) | (L, v) A4 q} is infinite, too. Let 
ulw]p € S. So (L,w) —4 q. Since C is ground and C[g] —4.. dr, we obtain 
C[g] —4 qr from Lemma 3. We have C[w] = (t, u[w]5) as p € Pos(u) V Pos(t). It 
follows that (t, u[w];) € L(A) and thus there are infinitely many terms u such 
that (t,u) € L(A). Since o = L(A) we conclude the desired t € INFo. v 


'The final step to conclude that the infinity predicate is indeed regular is now 
easy. 
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Proof (of Theorem 1). Combining Theorem 2 and Theorem 3 yields the following 
equivalence: 


teINFe <= (t,u) € L(A) for some term u 


Hence a tree automaton that accepts INF, is obtained by subjecting Ay to a 
projection operation (on the first argument). 


Projection on RR, automata has been formalized in Isabelle/HOL as part 


of [10]. v 
The mistake in the proof given in the appendix of [16] is quoted below and 
corresponds to the proof of Theorem 2: 


The set U = {u € T(F) | (t,u) € o} is infinite. Since the signature F is 
finite, infinitely many terms u in U have a height greater than t. Hence 
there exists a position p ¢ Pos(t) such that the set U’ = {u cU | pe 
Pos(u)} is infinite. For every u € U’ we have (t,u)|, = (L,ulp). Since 
(t,u) is accepted by A and Q is finite, there must exist a state q’ such 
that (L,u|p) —4 q' for infinitely many terms u € U’. Therefore q' € Qs. 


The following example refutes the above reasoning, which is the key step in the 
proof in [16]. It was found in attempt to formalize the proof. 


Example 3. Let t = f(a,b) and consider the infinite set U = (f(f(a, b), g"(b)) | 


n > 1). The automaton 


A = ({f, 8, a, by), ia, bance, qo}, Q6; A) 


with A consisting of the transition rules 


fF(q4, qs) > qe La q2 bg(q1) > 45 lb >q 
af(q2,93) — qa Lb qa ig(a) > % 


accepts the relation o = {t} x U. Consider the position p = 11. We have p ¢ 
Pos(t) and p € Pos(u) for all terms u € U. Hence U’ = U. Moreover, (t, u)|p = 
(1,a) = La for all terms u € UY’. The only state reachable from La is q2 and 
clearly qo € Qs. 


4 Executable Infinity Predicate 


Owing to the definition of Q.., the automaton Aæ defined in Definition 3 is 
not executable. In this section we give an equivalent but executable definition of 
Qo, which we name QS: 


QS, = {q | p ~ p and p ~ q for some state p € Q} v 


Here the relation ^» is defined using the inference rules in Figure 1. Before 
proving that the two definitions are equivalent, we illustrate the definition of 
QS, by revisiting Example 1. 
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Lf(pi,---;Pn) 4a P pq q>ar pq qr 
pip c pap por por 


Fig. 1. Inference rules for computing QS. 


Example 4. We obtain 3 ^» 5 and 4 ~ 5 by applying the first inference rule to 
the transition rule Lf(3,4) > 5. Similarly, Lg(5) — 5 gives rise to 5 ~~ 5. Since 
A has no epsilon transitions, no further inferences can be made. It follows that 


QS, = (5). 


We call a term in T({L} x F) right-only. A term in T(({L} x F) U {0}) 
with exactly one occurrence of the hole O is a right-only context. 


Definition 4. We denote the composition of ^4 . and > by > A. 


The proof of the next lemma is straightforward. Note that the relations =>% 
and —^4 do not coincide on mized terms, involving function symbols and states. 


Lemma 5. Let C be a ground context. We have Cp] >% q if and only if p >% 
p' and C[p] =% q for some state p' . v 


Lemma 6. Qo € QS 
Proof. We start by proving the following claim: 


if C[p] =% q and C is a non-empty right-only context then p~>q (*) 


We use induction on the structure of C. If C = O there is nothing to show. 
Suppose C = Lf(ti,...,C’,...,tn) where C” is the i-th subterm of C. The 
sequence C[p| >% q can be rearranged as Cfp] = Lf(ti,...,C’[p],..-,tn) —À 
Lf(qu...,q&) >a q' >ñ q. We obtain qi ^» q' and subsequently q; ^» q by 


using the inference rules in Figure 1. If C" = O then p = q; and if C’ # 
then the induction hypothesis yields p ^» q; and thus p ^» q by transitivity. This 
concludes the proof of (x). v 


Assume q € Qo, so there exist infinitely many terms t such that (L, t) >% q. 
Since the signature is finite, there exist terms of arbitrary height. Thus there 
exists an arbitrary but fixed term t such that the height of t is greater than the 
number of states of Q. Write t = f(ti,...,tn). Since the height of t is greater 
than the number of the states in Q, there exist a subterm s of t, a state p, and 
contexts C1 and C» 4 O such that 
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From Lemma 5 we obtain a state q’ such that p >% q’ and C5[q'] >% p. Hence 
q' ~ p by (x). We obtain q' ^» q' from q' ~ p in connection with the inference 
rule for epsilon transitions. We perform a case analysis of the context Ci. 


— If C1 = O then p >% q and thus q' ~ q follows from q' ~> p in connection 
with the inference rule for epsilon transitions. Hence q € QS- 

— If C1 Z O then Lemma 5 yields a state q” such that p =>% q” and C; [g"] ^A 
q. Hence q” ~~ q by (*). We also have C5[q'] —4 q” and thus q' ~ q” by 
(*). We obtain q' ^» q from the transitivity rule. Hence also in this case we 
obtain q € QS. v 


For the following lemma, we need the fact that A can be assumed to be trim, 
so every state is productive and reachable. We may do so because Theorem 1 
talks about regular relations, and any automaton that accepts the same language 
as A will witness the fact that the given relation o is regular. 


Lemma 7. Q& C Qo, provided that A is trim. 


Proof. In connection with the fact that A accepts o C 7 (.F) x T(F), trimness 
of A entails that any run t =>% q is embedded into an accepting run C[t] 274 
Clq) >^ af € Qr. So C[t| = (u,v) for some (u,v) € o, and hence t must be a 
well-formed term. Moreover, if root(t) = Lf for some f € F then t = (L, u) for 
some term u € 7 (.F). We now show the converse of claim (x) in the proof of 
Lemma 6 for the relation —7: 


if p ^ q then C[p] >^ q for some ground right-only context C # (xx) 


We prove the claim by induction on the derivation of p ~ q. First suppose 
p ~ q is derived from the transition rule Lf(pi,...,pi,...,~Pn) > qin A 
with p; — p. Because all states are reachable by well-formed terms, there ex- 
ist terms t),...,tn € T(F) such that (1,7) —' pi for all 1 € i € n. Let 
C, = Lf((l,th),...,0,...,(L,tn)) where the hole is the i-th argument. We 
have Cilo] >% Lf(pi.....Di,---. pa) >a q- Next suppose p ~> q is derived 
from p ~ q' and q' >a q. The induction hypothesis yields a ground right-only 
context C # O such that C[p] =>% q'. Hence also C[p] =>% q. Finally, sup- 
pose p ^» q is derived from p ^» r and r ^» q. The induction hypothesis yields 
non-empty ground right-only contexts Cı and C5 such that C;[p] >% r and 
Co[r] =>% q. Hence C[p] =>% q for the context C = C2[C1]. This concludes the 
proof of (x). v 


Now let q € Q&. So there exists a state p such that p ^ p and p ~ q. 
Using (**), we obtain non-empty ground right-only contexts C, and C2 such 
that Cilp] >% p and C5[p] —4 q. Since all states are reachable, there exists 
a ground term t € T (2) such that t —*, p. Hence Co[(t] 2, q and, by the 
observation made at the beginning of the proof, C2[¢t] is a well-formed term. 
Since C% is right-only, it follows that t = (L, u) for some term u € 7 (.F). Now 
consider the infinitely many terms t,, = C»[CT' [t] for n > 0. We have tn 24 q 
and tn is right-only by construction. Hence q € Qu. v 
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Corollary 1. QS = Qo, provided that A is trim. 


5 Normal Form Predicate 


The normal form predicate NF can be defined in the first-order theory of rewrit- 
ing as 


NF(t) == adultu) 
and this gives rise to the following procedure: 


1. An RRə automaton is constructed that accepts the encoding of the rewrite 
relation >. 


2. The RRə automaton of step 1 is projected into a tree automaton that accepts 
the set of reducible ground terms. 


3. Complementation is applied to the automaton of step 2 to obtain a tree 
automaton that accepts the set of ground normal forms. 


Since projection may transform a deterministic tree automaton into a non- 
deterministic one, this is inefficient. In this section we provide a direct con- 
struction of a tree automaton that accepts the set of ground normal forms of a 
left-linear TRS, which goes back to Comon [5], and present a formalized correct- 
ness proof. Throughout this section R is assumed to be left-linear. 


We start with defining some preliminary concepts. 


Definition 5. Given a signature F, we write F, for the extension of F with a 
fresh constant symbol L. Given t € T(F, V), t+ denotes the result of replacing 
all variables in t by L: 


rt = L fne = ferata) v 


We define the partial order < on T (.F1) as the least congruence that satisfies 
1L <t for all terms t € T(.F4): 


tı S u1 E tn X Un 
L<t F(ti,.--, tn) S f(u,..-, un) v 


The partial map T: T(.F4) x T(F4) — T(F4) is defined as follows: 
Ltt = tT. =t f(t,...,t«) T (ui... Un) = f T u1,.--,tn T Un) v 


It is not difficult to show that t f u is the least upper bound of comparable 
terms t and u. 


Definition 6. (V Let R be a TRS over a signature F. We write T+ for the set 
{t+ | t< £ for some (  r € R}U {1}. The set T; is obtained by closing T+ 
under T. 
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Example 5. Consider the TRS R consisting of following rules: 
h(f(g(a),z,y)) + &(a) — g(f(r,h(z),y)) — v — (f(x,y, h(a))) > h(x) 
We start by collecting the subterms of the left-hand sides: 


T^ = [La,g(a), hCL), h(a), f(g(a), L, L),f(L,h(L), D) f(L, L, h(a))) 


Closing T+ under T adds the following terms: 


f(g(a), L, L) T f(L,h(L),L) = f(g(a), hCL),.L) 
f(L,L,h(a)) t fL; hCL), +) = fCL,h(L), h(a)) 
f(g(a),h(L), L) t fCL, hCL); h(a)) = f(g(a), h(L), h(a)) 


Lemma 8. The set Ty is finite. 


Proof. If t + u is defined then Pos(t t u) = Pos(t) U Pos(u). It follows that the 
positions of terms in T; V T+ are positions of terms in T+. Since T+ is finite, 
there are only finitely many such positions. Hence the finiteness of Tù} follows 
from the finiteness of F. 


Although the above proof is simple enough, we formalized the proof below 
which is based on a concrete algorithm to compute T}. Actually, the algorithm 
presented below is based on a general saturation procedure, which is of indepen- 
dent interest. 


Definition 7. Let f: U xU > U be a (possibly partial) function and let S be a 
finite subset of U. The closure C+ (S) is the least extension of S with the property 
that f (a,b) € Cg(S) whenever a,b € C,(S) and f(a,b) is defined. 


'The following lemma provides a sufficient condition for closures to exist. The 
proof gives a concrete algorithm to compute the closure. 


Lemma 9. If f is a total, associative, commutative, and idempotent function 
then Cp(S) exists and is finite. 


Proof. A straightforward induction proof reveals that for every a € C; (S) there 
exist elements @1,...,@n € S such that a = (ai, f(as,... f (an 1, aa) ...)). 
Select an arbitrary element b € S. If b is among a4,...,a,, then, using the 
properties of f, we obtain a € {f(b,c) | c € Cy(S \ {b})}. If b is not among 
a1,- .-, an then a € C;(S X {b}). Hence 


CCS) = Cj(SV (by) U {b} U (f (bo) | ee CCS N (09)] 


for every b € S. Since S is finite, this gives rise to an iterative algorithm to 
compute C; (S), which is given in Listing 5. In each iteration only finitely many 
elements are added. Hence C'; (S) is finite. v 


190 A. Lochmann and A. Middeldorp 


saturate(S): 
Iø 
for all x € S do 


I< {z} UTU {f(æ,y)| y €T} 
return / 


Listing 1. Iterative closure algorithm. 


Since our function T is partial, we need to lift it to a total function that 
preserves associativity and commutativity. In our abstract setting this entails 
finding a binary predicate P on U such that f(a,b) is defined if P(a,b) holds. 
In addition, the following properties need to be fulfilled: 

— P is reflexive and symmetric, 
— if P(a, f(b,c)) and P(b,c) hold then P(a,b) and P(f(a,b),c) hold as wel, 

for all a,b,c € U. 

For the details we refer to the formalization. CeCe Y IY 


Definition 8. Y The tree automaton Aner) = (F,Q, Qs, A) is defined as 
follows: Q = Qs = Ty and A consists of all transition rules 


fpi... pa) >q cd 


such that f(pi,..., Pn) is no redex and q is the maximal element of Q satisfying 
q < f(pi,-+-,Pn).4 


Example 6. For the TRS R of Example 5, the tree automaton A ne(r) consists 
of the following transition rules: 


2 ipei A dtyed 

Pes «p if p d {1,6,9, 10} ^L if p d {1,8,10} 
5 ifp-2,q4 {3,4} 

ifpZ2,q€ (3,4), rZ4 

ifq ¢ (3,4, 7—4 

ifp-22,q€ {3,4}, rz4 

ifpZ2,q€ (3,4), r—4 

10 ifp=2,qe€ {3,4},r=4 

0 otherwise 


f(p,q,r) ^ 


O o N OO 


Here we use the following abbrevations: 


0=1 3=h(1) 6=f(L,h(1), 1) 8= f(g(a),h(1), L) 
l=a 4 = h(a) 7 =f(1L,1,h(a)) 9 = f(L,h(L), h(a)) 
2=g(a) 5=f(g(a),1,1) 10 = f(g(a), hCL), h(a)) 


^ Since states are terms from T» here, Definition 5 applies. 
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As can be seen from the above example, the tree automaton .Awr(m) is not 
completely defined. Unlike the construction in [5], we do not have an additional 
state that is reached by all reducible ground terms. 


Before proving that Ayr) accepts the ground normal forms of R, we first 
show that JAwr(g is well-defined, which amounts to showing that for every 
f(pi,....pn) with f € F and pi,...,p& € T$ the set of states q such that 
q € f(pi...., Pn) has a maximum element with respect to the partial order <. 


Lemma 10. For every term t € Ty the set (s € T; | s < t) has a unique 
maximal element. 


Proof. Let $ = {s € T} | s < t}. Because | < t and LET}, S # Ø. If 51,59 € T 
then sı € t and sə < t and thus sı T s2 is defined and satisfies sı T so < t. Since 
Ty is closed under f, sı f s2 € T3 and thus sı f s2 € P. Consequently, 5 has a 
unique maximal element. 


The next lemma is a trivial consequence of the fact that Anf(r) has no 
epsilon transitions. 


Lemma 11. The tree automaton .Awr(m) is deterministic. v 


Lemma 12. Ift € T(F) with t +* q and st < t+ for a proper subterm s of 
some left-hand side of R then st < q. 


Proof. We use structural induction on t. Let t = f(ti,...,tn). We have t =>% 
f(@i;+-+;n) >a q. We procede by case analysis on s. If s is a variable then 
s+ = | and, as L is minimal in €, we obtain s^ < q. Otherwise we must have 
root(s) = f from the assumption st < t+. So we may write s = f(s1,..., 54). 
The induction hypothesis yields s+ < q; for all 1 € i < n. Hence s+ = 
f(st,.-.,82) < f(qu...,q). Additionally we have st € Q by Definition 8 
as s is a proper subterm of a left-hand side of R. Since f(qi,...,q4) > qisa 
transition rule, we obtain f(s1,...,5,)+ <q from the maximality of q. v 


Using the previous result we can prove that no redex of R reaches a state in 
A NF(R)- 


Lemma 13. [ft € T(.F) is a redex then t >*, q for no state q € Ty. 


Proof. We have £^ < t for some left-hand side £ of R. For a proof by contradic- 
tion, assume t —4 q. Write t = f(t1,...,t,). We have t 94 f(qi.....d4) 9A4 
and obtain 4+ < f(qi,...,q4) by a case analysis on £ and Lemma 12. Therefore 
the transition rule f(q1,...,q4) >a q cannot exist by Definition 8. v 


Lemma 14. [ft —'4 q andt € T(F) then q & t. 


Proof. We use structural induction on t. Let t = f(ti,...,tn). We have t =>% 
f (qi... q5) >% q. The induction hypothesis yields q; € t; for all 1 < i € n and 
thus also f(qi,..., qn) € f(ti,..-,tn). We have q € f(qi,...,q5) by Definition 8 
and thus q € t by the transitivity of «. v 
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Lemma 15. Ift € NF(R) then t —4 q for some state q € T3. 


Proof. We use structural induction on t. Let t = f(ti,...,tn). Since t1,...,tn € 
NF(7&) we obtain f(t1,... tn) >% f(qi,---,@n) from the induction hypothesis. 
Suppose f(q1,...,q,) isa redex, so l+ € f (q1, ---, qn) for some left-hand side £ of 
R. From Lemma 14 we obtain q; € t; for all 1 < i € n and thus f(qi,...,q4) € 
f (t, ..., t4). Hence 4+ € f(t1,...,t,). This however contradicts the assumption 
that t is a normal form. (Here we need left-linearity of R.) Therefore f(qi,..-, dn) 
is no redex and thus, using Lemma 10, there exists a transition f(q1,...,dn) > q 
in A and thus t —4 q. v 


Theorem 4. Jf R. is a left-linear TRS then L(Awr(g)) = NF(R). 


Proof. Let t € T(F). If t € NF(R) then t =>% q for some state q € T4 by 
Lemma 15. Since all states in Ty are final, t € L(Anecp))- v 


Next assume t ¢ NF(R). Hence t = C|s] for some redex s. According to 
Lemma 13 s does not reach a state in ANF(R): Hence also t cannot reach a state 
and thus t ¢ L(A NF())- v 


6 Conclusion and Future Work 


In this paper we presented formalized correctness proofs of the regularity of 
the infinity and normal form predicates in the first-order theory of rewriting. 
For the former we also provided an executable version, which is important for 
checking certificates that will be provided in a future version of FORT. Our 
results are an important step towards the ultimate goal of proving the correctness 
of the decisions reported by FORT, but much work remains to be done. We are 
developing a certification language which reflects the high-level proof steps in the 
decision procedure for the full first-order theory of rewriting. This language will 
be independent of FORT. In particular, details of the intermediate tree automata 
computed by FORT will not be part of certificates. This keeps the certificates 
small and avoids having to implement a verified (and expensive) equivalence 
check on tree automata. We will provide executable Isabelle code for each of 
the constructs in the certification language, and so this involves replaying the 
automata constructions in Isabelle. 


We conclude the paper by providing some details of the size of our formal- 
ization in Table 1. 
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Abstract. Fixpoint logics have recently been drawing attention as com- 

mon foundations for automated program verification. We formalize fold/unfold 
transformations for fixpoint logic formulas and show how they can be 
used to enhance a recent fixpoint-logic approach to automated program 
verification, including automated verification of relational and tempo- 

ral properties. We have implemented the transformations in a tool and 
confirmed its effectiveness through experiments. 


1 Introduction 


A wide range of program properties can be verified by reducing to satisfiabil- 
ity/validity in a fixpoint logic [3-6, 18, 20, 22, 23, 29, 35]. In this paper, we build 
on top of MuArith, a first-order logic with least/greatest fixpoint operators and 
integer arithmetic, recently proposed by Kobayashi et al. [22]. It offers a powerful 
tool to handle the full class of modal ju-calculus properties of while-programs (im- 
perative programs with loops but without general recursion). In contrast, earlier 
studies on temporal program verification require different methods for each sub- 
class of the modal u-calculus properties, such as LTL [12,16,28], CTL [2,3,13,34], 
and CTL* [11]. The recent program verifier based on MuArith [22] is effective 
in practice, i.e., by exploiting general-purpose solvers for Satisfiability Modulo 
Theories (SMT) and Constrained Horn Clauses (CHC), it can outperform tools 
designed specifically for CTL verification of C programs [13]. 

Despite these promising results, the generality of the fixpoint logic approach 
come at a cost. Since fixpoint logic formulas obtained by reduction from various 
verification problems often involve nested fixpoint operators, it could be chal- 
lenging to check the validity of these formulas automatically. To enhance the 
capability of fixpoint logic provers, in this paper, we propose novel fold/unfold 
transformations and prove their correctness. These transformations are generally 
used to simplify relational verification, and in particular, to reduce the num- 
ber of recurrences used in the program (or a set of programs) under analysis. 
Originally proposed for logic programming [8, 19,32], they have been recently 
adopted for determining the satisfiability of CHC [15,26] and allow discovery of 
relational invariants for a pair of loopy (or recursive) programs, as opposed to 
invariants within each individual program. Our transformations can be regarded 
as extensions of such transformations for a fixpoint logic, where quantifiers and 
arbitrarily nested least/greatest fixpoint operators are allowed. 


© The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 195-214, 2020. 
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We also present a procedure that seeks a way to apply the proposed fold/unfold 
transformations efficiently. Besides non-determinism in the choice of which fix- 
point formulas to unfold, our “fold” operation replaces a formula $ with P (where 


P is the predicate defined by P 2 $) and requires various reasoning to convert 
the current goal formula to a form E[¢], where the form of E can be more com- 
plex than in the case of fold/unfold transformations for logic programming or 
CHC. 

We have implemented the transformations and integrated them with the 
program verifier Mu2CHC [22] based on MuArith. We considered a number of 
examples of MuArith formulas which include formulas obtained from program 
verification problems for checking relational and temporal properties. Our new 
transformations allowed Mu2CHC to solve these formulas, which would not be 
doable otherwise. 

To sum up, our contributions are: (i) a formalization of fold/unfold transfor- 
mations for a fixpoint logic and proofs of their soundness, (ii) demonstration of 
the usefulness of the proposed transformations for verification of relational and 
temporal properties of programs, and (iii) a concrete procedure for automated 
transformation and its implementation and experiments. 

The rest of this paper is structured as follows. Section 2 reviews the defini- 
tion of the first-order fixpoint logic MuArith [22], and reductions from program 
verification problems to validity checking in MuArith. Section 3 formalizes our 
transformations and proves their correctness. Section 4 shows applications of 
our transformations to verification of relational and temporal properties of re- 
cursive programs. Section 5 reports an implementation and experimental results. 
Section 6 discusses related work and Section 7 concludes the paper. 


2 First-Order Fixpoint Logic MuArith 


We review the first-order fixpoint logic MuArith [22] in this section. MuArith is a 
variation of Mu-Arithmetic studied by Lubarsky [25] and Bradfield [7], obtained 
by replacing natural numbers with integers. 


2.1 Syntax 


The set of (propositional) formulas, ranged over by y, is defined in the following 
grammar. 


p (formulas) ::= a4 > ag | PO(a,,... T7398 


pi V p2 | 91^ 3 | Vz. | 3x. 
po) (k-ary predicates) ::= x» | A(z1,..., mx). | 


pXU (a, .. . m). | v B (z4,..., yk). 


a (arithmetic expressions) ::— n | x | a1 + a» | a1 — a2 


The metavariable y represents a proposition, and P denotes a predicate on 
(a tuple of) integers. We write T for 0 > 0 and L for 0 > 1. In examples, 
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we may also use other relational symbols such as > and =. The meta-variable 
x denotes an integer, and the meta-variable X(*) denotes a k-ary first-order 
predicate variable. We write ar(X(9) for the arity of the predicate variable 
X, i.e., k; we often omit the superscript (k) and just write X for a predicate 
variable. The predicate pX% (z4, . . . az). (resp. v X9 (g4,... ,24,).q) denotes 
the least (resp. greatest) predicate X such that X(x1,...,v%) equals y. 


Example 1. Let iX (x).(x = 0 V X(x — 1)) denote the least predicate X, such 


that X(r) 22 20V X(x—-1) 2» 20vz—-1-20vX(x-2)2-.,ie, 
A(x).r > 0. In contrast, yX (x)(x = 0 V X(x — 1)) denotes A(x).T. 


We write FV(y) for the set of free (predicate and integer) variables in 
9; Va, da, n X 9 , p X 0), and Ax are binders. We sometimes write Z for a se- 
quence of variables z,,...,z,. We often write Y and X for the De Morgan 
dual of a formula p and a predicate variable X, respectively. For example, 
LX (x).x =0V X(—1) = vX(z).» Z 0^ X(x — 1). Here, X is a predicate 
variable, so the righthand side is a-equivalent to v X (x).z Z 0 ^ X(x — 1). The 
overline for X is used to indicate that it corresponds to the dual of X in the 
original formula uX (x). = 0 V X(x — 1). 


2.2 Semantics 


In this subsection, we define the formal semantics of formulas. Let Z be the set 
of integers, and B = (Tp, Lg}, with Lg Cp Tp. Let Dj be the set Z^ — B of 
functions (where Z* denotes the set of tuples consisting of k integers). We define 
the partial order C; on Dj, by: 


f Ex g & Vni,...,my € Z.f(ni,..., my) Ep g(ni,..., mi). 


Note that (Di, Ex) is a complete lattice, with Az1.--- Ary..L pg and Ary.--- Arp. TB 
as the least and greatest elements. We write Lẹ and T; for ÀAzi.::- Axy..L pg and 
Az1. ++- Ary. T p, respectively. We also write r1? (resp., LI?) for the greatest 
lower (resp., least upper) bound with respect to Eg. We often omit k and B and 
just write T,.L, E, T1, LI, etc.. We often identify B and Do = Z°? > B. We write 
Dx — D, for the set of monotonic functions from D; to De. 

We write Env for the set of functions that map each integer variable to an 
integer, and each k-ary predicate variable to an element of D;,. For a formula y 
(resp., a predicate P and an expression a) and an environment p € Env such that 
FV(o) C dom(Env) (resp., FV(P) C dom(Env) and FV(a) C dom(Env)), 
Fig. 1 defines the semantics [.]p of p (resp., P and a), where for a mono- 
tonic function F € D, > Dp, LFP?(F) = [1?(f e D, | f De F(f)) and 
GFP®) (F) = UPF € Dx | f Ex F(f)). When o and P are closed (i.e., do not 
contain free variables), we just write [v] and [P] for [y]9 and [P]0 respectively. 
By abuse of notation, we often write y I v if [v] p 3 [v]p for any (valid) environ- 
ment p such that FV (o)UFV (v) C dom(p), and y = vif [e] p = [v]p; similarly 
for predicates. For example, 3z.(x > z Az » y)2(r»y-c1)d(zr»y-2). 
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T if [a > [a 
fs edo È 1 i feale cle 
[P(a,..-,ax)]e = [Plo(Tailo. .. larlp) 
[X]o = p(X) 
[ex V exo = [eile le] 
[ex ^ exl = [eile le] 
[Vae] o =| ]sezlelotz => n) 
[32-9] o = | |nezlelotz ^ n) 
[A(a1, ...,21)-g]o = A(ni,..., nx) € Z^. [olo 9 ni... £k ne} 
[xX (a, ..., e)p] = 
LFP (Af € Di.A(ni,...,nx).|e]pCX > f, xi n1,..., £k nx])) 
[v XO (21,...,2%).y]p = 
GFP“) (Af € Di.A(ni,...,nx).|e]oCX > f, xi n1,..., £k nef) 
[n]o 7 n 
[z]? = p(x) 
[a1 + az] o = [ai] + [a2] 
[a1 — a2] o = [ai] — [a2] 


Fig. 1. The semantics of formulas. 


Example 2. Recall formula uX (z).z = 0 V X(x — 1) from Example 1. We have 
[UX (x).2 20v X(r—1) = LFP® (F), with F = Af € Dy.An € Z(n = 
0) LI f(n—1). Since for any m, F"(Ax € Z.L) = An € Z.0 < n < m— 1, we have 
LFP (F) = An € Z.0 < n (here, < denotes the semantic relation on integers). 
In contrast, [vX (x).z = 0 v X(x — 1)] = GFP (F) = An € Z.T. 


2.3 Program Verification as Validity Checking of MuArith Formulas 


Various verification problems for first-order recursive programs can be reduced to 
validity of MuArith formulas. We refer the reader to [22] for a general reduction 
schema from temporal properties to MuArith formulas. However, as shown in 
this subsection, some formulas require additional handling that motivates the 
need for new transformations to be presented in Section 3. 

Consider the following functional program (written in the syntax of OCaml) 
that multiplies two numbers. 


let rec mult(x, y) if y-0 then O else x + mult(x,y-1) 


Then, the ternary relation Mult(x,y,r) that expresses "mult(zr,y) terminates 
and returns r" is expressed as the following MuArith formula: 


nMult(z,y,r).(y- 0^r —-0)V3s.(y ZO0^r-— z-- s^ Mult(x,y — 1,s)). 
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This lets us express a partial correctness property “if P(x, y) holds and mult(z, y) 
terminates and returns r, then Q(x, y, r) holds? by: Vx, y, r.P(x, y)AMult(a, y,r) => 
Q(x, y, r). It can further be rewritten to the following MuArith formula: 


Vr, y, r.P(z,y) V Mult(x,y,r) V Q(z, y, r), (1) 


where P and Mult are respectively De Morgan duals of P and Mult; Mult can 
be expressed by: 


v Mult(x,y,r).(y zz 0Vr zZ 0)AVs.(y—-O0Vrzz-sV Mult(z,y — 1, s)). 


The total correctness “if P(x,y), then mult(z,y) terminates and returns r, 
such that Q(r,y,r)" can be expressed by: Vz,y.P(x,y) = dJr.Mult(x,y,r) ^ 
Q(z, y,r), which is equivalent to the MuArith formula: 


Va, y.P(z, y) V (3r.Mult(z,y,r) ^ Q(z,y,r)) 


As a special case, the termination property “if y > 0 then mult(z, y) terminates” 
can be expressed by: 


Vz,y.y < 0 V 3r.Mult(z, y, r). (2) 


We can also express relational properties of programs such as the equivalence 
of two programs. Let us consider another implementation of multiplication: 


let mult2(x,y) - 
let rec multacc(x,y,a) = if y=0 then a else multacc(x,y-1,x*a) 
in multacc(x,y,0) 


Then predicate Multacc(x, y, a, r) which represents “multacc(z, y,a) terminates 
and returns r" can be expressed by: 


pMultacc(x,y,a,r).(y 2 0^r — a) V (y ZZ 0^ Multacc(z,y — 1, x + a,r)). 


Thus, the equivalence of mult and mu1t2 can be expressed by: Vz, y, r. Mult(x, y, r) 
< Multacc(x, y, 0, r), which can be expressed by the conjunction of the MuArith 
formulas: 


Yx, y, r.Mult(z,y,r) V Multacc(z, y, 0, T) (3) 
Vz,y,r.Mult(r,y,r) V Multacc(z, y, 0,r) (4) 


where Multacc is the De Morgan dual of Multacc, defined analogously to Mult. 


Motivation. Kobayashi et al. [22] presented a method for proving the validity of 
MuArith formulas. It can prove formula (1) valid: since there are neither p nor 3, 
it is reducible to the problem of satisfiability of CHC [4]. However, the method 
is not powerful enough on formulas (2) and (3) for termination and program 
equivalence, respectively. It first tries to eliminate existential quantifiers and u- 
formulas, so that the resulting formula can be reduced to the satisfiability of 
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CHC. But it fails when the witness of an existential quantifier (i.e., r such that 
dr.) is not bounded by a linear expression, e.g., the witness for Jr is a non-linear 
expression x x y in the case of (2). This is unfortunate, as methods specialized 
on proving program termination, e.g. [18], can easily prove the termination of 
program mult. Thus, in order to exploit the advantage of the uniform approach 
to program verification based on MuArith, we need to strengthen the method 
for proving MuArith formulas. 


2.4 Auxiliary Definitions 


We introduce additional definitions on formulas, which will be used later in 
our formalization of fold/unfold-like transformations. A (k, ¢)-contezt (or, just a 
context) is an expression obtained from an ¢-ary predicate by replacing a k-ary 
predicate variable with [] (in other words, a context is a predicate that may con- 
tain [] as a special predicate variable). For a context C and a predicate P (that 
does not contain free occurrences of variables bound in C), we write C[P] for the 


predicate obtained by replacing [] with P. For example, C E A(x, y)-3z.|](z. z. y) 
is a (3, 2)-context, and C[A(z, y, z).(z > y ^y > z)] is A(z, y).3z.(A(z, y, z).( > 
y ^y  z))(z, z, y), which is equivalent to A(x, y).dz.xz > z ^ z >y. 

For a function F € D, > Dy, we say that F is continuous if it pre- 
serves the least upper bound, i.e., F(|]res f) = Uses F(f) for any (possi- 
bly infinite) set S C Dx. Similarly, we say that F is co-continuous if it pre- 
serves the greatest lower bound, i.e., F([|peg f) = peg F(f). For example, 
Af.f ^g € Do > Do and Af.f Ag is both continuous and co-continuous for 
any y € Do. In contrast, Af.3r.f(x) € Di — Do is continuous but not co- 
continuous;^ Af.Vz.f(x) € Dı — Do is co-continuous but not continuous. We 
say that a context C is continuous if its semantics, i.e., Af.[C[X]] X > f) is; 
analogously for co-continuity. 

The following lemma (which follows immediately from the definition) pro- 
vides a syntactic condition that is sufficient for the co-continuity of a context. 


Lemma 1. Let C be a (k,0)-context. If C can be generated by the following 
syntaz, then C is co-continuous.? 


C :- [|] | A(21,...,2&).C | C(ai,- ak) | C^ e| g^C|CVo|opvC|Vz.C 


Remark 1. The syntax and semantics of MuArith was defined based on hierar- 
chical fixpoint equations (HES) in [22]. The above semantics is equivalent to that 
of [22], modulo the standard conversions between fixpoint formulas and HES. 


* [n fact, let F = Af.3x.f(z) € Dı — Do and S = (Ar.z > n |n € Z}. Then 
F(f) =T for any f € S, but F( pes f) = F(z.L) = L. 

5 Here, for the sake of simplicity, we mix the syntax of contexts that yield predicates 
and propositions. 
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3 Fold/Unfold-Like Transformations 


In this section, we present new fold/unfold-like transformations for MuArith, 
to enhance the power of MuArith validity checkers. We first informally review 
fold/unfold transformations for logic programming and explain what kind of 
transformation we wish to apply to MuArith formulas in Section 3.1. We then 
prove theorems that justify such transformations in Sections 3.2 and 3.3. 


3.1 Overview of Transformations for MuArith 


Revisiting Fold/Unfold Transformations for Logic Programming The 
original concept [32] is presented in the following example, where each recurrence 
is represented by a CHC (i.e., an implication involving uninterpreted predicates 
Even and Odd). 


Even(x) = x =0 Even(x) = x > 0, Even(x — 2) 
Odd(x) =x=1 Odd(x) <= x > 0, Odd(x — 2) 


We wish to prove that L <= Even(x), Odd(x). Many of the existing CHC solvers, 
such as HoICE [9] and Z3 [24], fail to prove it as they do not handle the divisibility 
constraints well. After defining a new predicate EvenOdd as EvenOdd(x) <= 
Even(a), Odd(x) and unfolding Even, we obtain the following new CHCs. 


EvenOdd(x) = x = 0, Odd(x) EvenOdd(x) = x > 0, Even(x — 2), Odd(x) 


By unfolding Odd(z) in the first CHC, its body becomes inconsistent. By un- 
folding Odd(z) in the second CHC, we obtain the following new CHCs. 


EvenOdd(x) = x > 0, Even(x — 2), 2 21 
EvenOdd(x) & x > 0, Even(x — 2), Odd(x — 2) 


By unfolding Even(z — 2), the body of the first CHC becomes inconsistent. Now, 
the part *Odd(x — 2), Even(x — 2)” in the second CHC matches the definition 
of EvenOdd, so we can “fold” it and obtain the following new CHC. 


EvenOdd(x) = x > 0, EvenOdd(x — 2) 


The least solution for EvenOdd is Ar.L, hence we have now obtained L <= 
Even(«), Odd(x) without synthesizing interpretations of Even and Odd over the 
divisibility constraints. 


Transformations for MuArith. The above example can be reformulated in 
MuArith. Predicates Even and Odd are expressed as follows. 


pEven(x).c = 0 V (x > 0^ Even(x — 2)) (5) 
i Odd(z).z = 1 V (x » 0^ Odd(ax — 2)) (6) 
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We wish to prove that Even(x) ^ Odd(x) is inconsistent, i.e. Vx. Even(z) V Odd (£) 
is valid where Even and Odd are: 


v Even(x).z Z: 0 ^ (x < 0 V Even(x — 2)) (7) 
vOdd(z).z Æ 1^(x € 0 v Odd(x — 2)) (8) 


Now, let Y (x) 5 Even(x) V Odd(x), which can be rewritten as follows. 


Y (x) = (x Z0^(x < OV Even(z — 2))) V (x ZZ 1^ (x € OV Odd(z — 2))) 
(x <OVa41V Even(x —2)) ^ (a € OV Even(z — 2) V Odd(x — 2)) 
=a < 0 V Even(z — 2) V Odd(z — 2) = z < 0v Y (x — 2) 


Based on this, we wish to replace Y with vY(x).x < 0 v Y (x — 2); then the 
validity of Vz.Y (x) would follow immediately. As we will see later in Section 3.3, 
this transformation is indeed sound. 

Intuitively, the above transformation works as follows. Given a formula CX], 
which contains a fixpoint formula X defined by the equation X = D[X], intro- 
duce à new predicate Y, such that Y — C[X]. Then, unfold X to D[X] and 
obtain Y = C[D[X]]. Then, rewrite C[D[X]] to a formula of the form E[C[X |]. 
By “folding” C[X], we obtain Y = E[Y], which serves as a new definition clause 
for Y. We wish to apply this kind of transformation not only to v-only formulas 
like above, but also to formulas involving u and quantifiers, as discussed below. 


Recall formula (2) from Section 2.3. Let X(x, y) E 3r.Mult(x, y, r). Then, 


X(x,y) = dr.((y 2 0^r-0)v3s.(y #0Ar — zo s^ Mult(z,y — 1,s))) 
zy-—0v(yz0^ds.Mult(z,y — 1,s)) 
zy-0Vv(yz0^xX(z,y- 1)) 


As justified later in Section 3.2, we can then replace X with uX (z, y).y =0V(y z 
OA X (zx, y — 1)). We are then left with formula Yz, y.y < 0 V X(z, y), which can 
then be proved valid by Mu2CHC [22], the existing MuArith validity checker. 

Let us also recall a generalized version of formula (3): 


Vz, y, a,r.Mult(z,y,r) V Multacc(x, y, a,r +a), 


which contains u and v. Let Y (x, y,a,r) = Mult(x,y,r) V Multacc(z, y, a, r +a). 
Then, we have: 


Y(z,y,a,r) = ((y4OVr #0)A^AYs.(y =0Vr # x+ sV Mult(z,y — 1,s))) 
v(y =0Ar +a = a)V (yz 0^ Multacc(x, y — 1,4 + a,r+a)) 
=(y=05r40Vr+a=a) 
Aly 40 => (Mult(z,y — 1,r — x) V Multacc(z,y — 1,2 4- a,r + a))) 
zyz0-Y(zy-lzcrar-z) 


As justified in Section 3.3, we can replace Y with vY (r,y,a,r).(y = 0VY (r,y— 
1,x +a,r — x)), giving us Vaz, y, a, r.Y (x, y, a, r) immediately. 
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Although the above transformations are sound, the soundness of fold/unfold 
transformations for MuArith is delicate in general. For example, consider formula 
jr. > y ^ X(x,y), where: 


X Ê yX(z, y). 2 y - 1^ X(z,y +1). 


It is obviously false since there exists no x that satisfies x > yAr2y-c-l1^zrzz 
yt2A---=Vn>0.0 > y +n. Let Y (y) 2 are > yA X(z,y). Then, 


Y(y) 


a(e>yAr>yt+1AX(2,y+1)) 
z.(£>y+1AX(xz,y +1) =Y(y+1). 


Based on this, one may be tempted to replace Y with vY (y).Y (y + 1) 2 Ay.T, 
but that is obviously wrong. 

In the next two subsections, we present theorems that justify all the trans- 
formations above except the last (invalid) one. 


3.2 Transformations for -Formulas 


In this subsection, we prove a theorem that enables the replacement of a pred- 
icate of the form C[u.X.D[X]] with one of the form uY.E[Y] and applies it to 
justify the transformation for Jr. Mult(x,y,r) discussed in the previous subsec- 
tion. The corresponding transformation for v-formulas is discussed in the next 
subsection. The theorem is stated as follows. 


Theorem 1. Let C, D and E be (k,€), (k, k), and (£, €)-contexts respectively. If 
C[D[|X]) 3, E[C[X]] holds for any k-ary predicate X, then we have: 
Clu X (21, ..., 2 )-DUX)(oi,. ..,0x)] De HY (go... ye) EIY]Qi.- vo). 


The theorem follows easily from the definition of the semantics of the least 
fixpoint operator. 


Proof. Suppose C[D[X]] 3 E[C[X]]. Then, we have 


CluX (®).D[X\(®)] = C1D[yX@).DIX]@)]] 3 ElCloX 2). PIX]. 


Since uY (y).E[Y](y) is the least predicate Y such that Y 3 E[Y], we have 
CluX (z).D[X](z)] 3 uY (y).E|Y](y) as required. 


To see how the theorem above enables fold/unfold-like transformations, sup- 
pose that we wish to prove a formula of the form Y = Cl[uX(2).D[X](z)]. It 
suffices to prove C[D[u.X (3). D|X](2)]], obtained by unfolding X. If the assump- 
tion C[D[X]] 3 E[C[X]] holds, we can change the goal to E[C [uX (2). D|X](x)]]. 
Thus, by the theorem, it suffices to prove uY (y). E|Y ](y), which is obtained by 
"folding" C[uX.D[X](Z)] to Y. Note that the theorem guarantees only that 
the transformation provides an underapproximation of the original predicate. A 
stronger condition is required for the equivalence; see Corollary 1 given later. 
Note also that finding an appropriate context E may not be easy in general; we 
discuss how to mechanically find E in Section 5. 
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Example 3. Recall again formula (2) from Section 2.3. Let us define C, D, and 
E by: 


C Ê X(2,y).3r[](a, y, 7) 
E Ê X(z, y) 20V (y E O^ [](x,y — 1) 
DÊ A(zx,y,r).(y- 0^r-0)v3s.((yzZ 0^r-— zc s^[](z,y— 1,s)). 


Then, for any ternary predicate X, we have: 


C(D(X]] = A(z, y).dr.(y 20^r-0)V3s.((yZ0^r-z--s^X(r,y—1,s)) 
zA(r,y).y 2-0 V3r,s.(y Z0^r — zo s^X(x,y—1,s)) 
= A(r,y)y = 0 V (y zz 0^3s.X(x,y—1,s)) = E[C[X]]. 


By Theorem 1, we have C[D[Mult]] 2 uY (x, y).y = 0 V (y # 0A Y (x, y)). Thus, 
the goal Yx, y.y < 0 V dr.Mult(x, y,r) has been reduced to: 


Vo, y.-y < 0 V (uY (v, y).y 2 0v (y £ZOA^Y (x, y)))(v. y), 
which can be proved valid by Mu2CHC. 


3.3 Fold/Unfold for v-Formulas 


We now prove a theorem that allows us to replace a predicate of the form 
C[vX.D|X]] with one of the form vY.E[Y]. It is similar to Theorem 1, but 
requires more conditions. Recall Lemma 1, which provides a sufficient syntactic 
condition for the co-continuity. 


Theorem 2. Let C, D and E be (k,£), (k,k), and (£,£)-contexts respectively. 
Suppose that the following conditions hold: (i) C[T™] 3, T9, (ii) C[D[X] 3 
E|C[X]], and (iii) C is co-continuous. Then C[v X (z1,...,zi).D[X](zi,...,24)] 3 
vY (yi, ert Ye) EY]; See Ye). 


Proof. For F € D} > Dz, f € Dy and an ordinal y, we define F?(T(9?) in- 
ductively by: F(T) = TH, prH(T()) = F(FY(T™))), and FY(T)) = 
Myre F(T) if y is a limit ordinal. By abuse of notation, we write D?[T 9] 
for [D]7(7) if D is a (k,k)-context. Since there exists an ordinal y such 
that vX.D[X] = D?[T(?] and vY.E[Y] = E7[T], it suffices to show that 
C[Dy[T]] 3, E"[T(?] holds for any ordinal y, by transfinite induction on y. 
The base case where y = 0 follows immediately from the first condition. If y is 
a successor ordinal y’ + 1, then 


C[D"[T]] 3 E[C[D [T] 3 E[E" [T] = E"[T]. 


Ds 


Here, we have used the induction hypothesis in the second inequality. If y is a 
limit ordinal, then we have: 


O[D YT] = Cy e, CO [T])] = Pw e, CED [T] 3 rw e, EY [T] = ET], 


Here we have used the co-continuity in the second inequality. We have thus 
proved C[D?[T(2]] 3, E"[T(2] holds for any ordinal y. We, therefore, have 
Cv X (z).D[X]()] 3 vY (y).E|Y](y) as required. 
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Example 4. Recall the formula Va, y, a, r.Mult(z,y,r) V Multacc(x, y,a,r + a) 
discussed in Section 3.1. Let us define C, D, E by: 


eS A(z, y, a, r).[](, y, r) V Multacc(z, y, a, r + a) 
D X(s,y,r).((y #0Vr £0) AVs.(y =0Vr #24+8V[](x,y—1,5))) 
ES A(z,y,a,r).y 20V [](zx,y — 1,z +a,r — x) 


They satisfy all the three conditions of Theorem 2. In particular, for any ternary 
predicate X, we have 


C[D[X]] = A(z, y, a,r).((y #0 V r A O)A 

Vs.(y=O0VrAat+sV X(r,y —1,5))) V Multacc(z, y, a,r +a) 
A(z, y, a, r).((y Z 0 V r z 0)A 

Vs.(y—0VrZzd4sV X(x,y —1,s))) 

V(y-0^r-ra-a)V(y z 0^ Multacc(x,y — 1,2 -- a,r 4- a)) 

A(z,y,a,r).y 20V X(z,y —l,r—m)V 

Multacc(z,y — 1,2 +a,r+a)) 
E|C|X]. 


based on the corresponding transformations shown in Section 3.1. We have thus 
Yz, y, a, r.Mult(z,y,r) V Multacc(z,y,a,r + a) I Vz,y,a,r.(vY (x,y,a,r).y = 
OvY(z,y—l,r-ca,r-—z))(r,y,a,r), and the righthand side can be proved to 
be valid by Mu2CHC. 


Note that Theorems 1 and 2 guarantee the soundness of the replacement 
of Cla X (#1,...,0,%).D[X](a1,...,2%)] with vY (y1,.--,ye)-E[X](y1,---, ye) (for 
a € (n, v]), but not completeness: the validity of ClaX(x1,...,v%).D[X](a1,..., 
14,)] does not necessarily imply that of vY (yi, ... , ye)-E[X](y1,.--, ye). Actually, 
by combining Theorem 1 and the dual version of Theorem 2, we obtain the fol- 
lowing corollary, which guarantees completeness under a stronger condition. 


Corollary 1. Let C, D and E be (k, £), (k, k), and (£,£)-contexts respectively. 
Suppose that the following conditions hold: (i) CLLV?] C, L®, (ii) C[D[X]] s; 
E|C[X]], and (iii) C is continuous. Then Clu X (x1,...,0%)-D[X](x1,.--,2%)] Se 
HY (yi, ---,Ye)-E[Y](y1,---> Ye). 


4 Further Examples 


In this section, we give more examples to demonstrate the utility of our trans- 
formations for relational/temporal property verification of recursive programs. 
4.1 Relational Reasoning on Recursive Programs 


Below we discuss an example which is beyond the reach for state-of-the-art CHC 
solvers (see e.g., [33], the end of Section 5). 
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Example 5. Consider the goal Vx, y, z, r.(Mult(x + y, z,r) = ds,t.Mult(x, z, s) ^ 
Mult(y, z,t) ^r = s + t), which is equivalent to: 


Yx, y, z,r.(Mult(z + y, z,r) V 3s, t.(Mult(z, z, s) ^ Mult(y, z, t) ^r = s 4 t)), 


where Mult and Mult are as given in Section 2.3. The following contexts C, D, 
and E satisfy the following three conditions of Theorem 2. 


cs A(z, y, z, r). [1 (£ + y, 2,7) V ds, t.( Mult (a, z, s) ^ Mult(y, z, t) ^r = 5 +t) 

DÊ A(z,z,r)(zz0Vvrz0)A^(z-0v[]lz,z—-lr-z)) 

E 2 A(z, y, z,r).(z —0v (z x 0A [](x,y, z m lr =r- y)))- 
By Theorem 2, we have C|Mult] 3 vY (x,y, z,r).E[Y](2, y, z,r) = A(x, y, 2,7). T. 
We have thus proved that Vx, y, z, r.C|Mult|(z, y, z,r) (i.e., Yx, y, z, 7-(Mult (a + 
y z, r) = ds, t. Mult(z, z, s) ^ Mult(y, z,t) ^r = s + t)) is valid. 


4.2 Proving Temporal Properties 


Here we give an example of proving a liveness property of a recursive program by 
using our transformation. The example is a variation of the example discussed 
in [22], but it cannot be handled by their method for proving MuArith formulas. 


Example 6. Consider the following OCaml program: 


let rec sum n = if n-0 then 0 else ntsum(n-1) 

let rec loop x = if x-0 then () else loop (x-1) 

let rec repeat n = let x = sum n in loop x; repeat(n*1) 
let main() = repeat 0 


Suppose that we wish to prove that the function repeat is called infinitely often. 
The reduction from linear-time temporal property verification to MuArith yields 
the problem of determining the validity of Repeat(0), where: 


Repeat 2 v Repeat (n).(3x.Sum(n, x)) ^ (Vx.Sum(n, x) V Loop(x))^ Repeat(n+1) 
Sum n Sum(n, x).(n 2 0 ^ xz 20) V (n ZO0ASr.Sum(n — 1,r) ^r =n+r) 
Loop 2 uLoop(z).x = 0 V (x # 0^ Loop(x — 1)). 


Here, Sum is the De Morgan dual of Sum. The validity of this formula cannot 
be proved by Mu2CHC due to the existential quantifier. Note that Mu2CHC replaces 
each existential quantifier Jx.p with a bounded quantifier da < a.p, and a must 
be a linear expression. In the example above, x is not linearly bounded by n. To 
remove the existential quantifier, let 


cê An.dax.[](n, x) 
EÊ Ann 0 V (n 40A [](n—1)) 
D Xn,z).(n 2 0^ 20) V (x ZOA3r[](n — 1,7) Az n4 r). 
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Algorithm 1: Fold/unfold for disjunction 


Input: Formula @ of the form X(f(z,y)) V Y (g(a, y)), where X and Y are 
predicates defined by ax, ay € {u,v}. 

Output: A formula & such that e I P’. 

a + if v € {ax,ay} then v else y; 

A v; + cnf(unfold(9)); 

for each v; do 

if v; has the form X(si) V Y (s2) V vi, f(ti, t2) = sı and g(t1,t2) = s2 
then 

Vi — Z(h,t2) V Vi; 

return aZ(z, y). A vi; 


À Ne 


O un 


Since C[D[X]] 3 E[C|X]] holds, we can apply Theorem 1 to underapproximate 
da.Sum(n, x) by uX (n).n = 0 V (n # 0^ .X(n— 1)). Therefore, the goal has been 
reduced to Repeat’ (0) where 


Repeat’ = v Repeat’ (n).X(n) ^ (Vz.Sum(n, x) V Loop(x)) A Repeat’ (n + 1) 
X Ê uX(n).n 20V (n £ 0^ X(n- 1)), 


which can be proved valid by Mu2CHC automatically. 


5 Algorithm and Evaluation 


In this section, we first present an algorithm for our transformation and then 
outline its implementation and report on experimental results. 


5.1 Algorithm 


Theorems 1 and 2 given in Section 3 state sufficient conditions for our fold/unfold 
transformation to be sound. In this subsection, we discuss how to systematically 
apply the theorems and how to find a context E. 

To make it easy to find E, we restrict input formulas of our transforma- 
tions to those of the form X(f(z,y)) V Y (g(zx,y)), X(f(z,y)) ^ Y(g(z, y)), and 
Jy. X (f(x, y)), where X and Y are predicates defined by fixpoint operators, and 
f(x,y) and g(x,y) denote (possibly sequences of) terms that may contain free 
variables x and y. For the sake of simplicity, we assume here that the definitions 
for X and Y are independent; X cannot be obtained by unfolding Y, and vice 
versa. Transformations for more complex formulas like the one in Example 5 can 
be achieved by repeatedly applying the transformations for smaller contexts. 

The transformation algorithm for disjunctive formulas is shown in Algo- 
rithm 1. It takes as input a formula @ = X(f(r,y)) V Y(g(r,y)) and out- 
puts an underapproximation $' of 4. It can take [](f(x,y)) V Y (g(z,y)) or 
X(f(x,y)) V [](g(z. y)) as the context C and apply Theorem 2 if X or Y is 
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Algorithm 2: Fold/unfold for 3 


Input: Formula ® of the form 3y.X (f(x, y)), where X is a predicate defined 
by u or v. 
Output: A formula & such that 3 P’. 
1 Vw; + dnf(normalizes(unfold(®))); 
2 for each v; do 
3 if V; has the form (3z.X(s)) A v;, and f(tz,y) = [tz/z]s, 
where FV (tz) C {x}, FV(t.) C {x,y} then 
4 vi € Z (ta) V vis 
return uZ(x). V vi; 


on 


defined by v, and Theorem 1 otherwise (line 1). On line 2, the algorithm un- 
folds X and YÓ and then normalizes the resulting formula to a conjunctive 
normal form (CNF), where quantified formulas are treated as atomic. It then 
applies the “fold” transformation to each conjunct w;. To this end, for each 
V; that contains X(s1) V Y(s2), the algorithm finds terms tı and t» such that 
X(s1) VY (s2) = X(f(ti,t2)) V Y(g(ti,t2)); this is achieved by solving the uni- 
fication constraints sı = f(z', y’) and s» = g(z', y’) modulo arithmetic theories, 
where z' and y’ are treated as variables but x and y are treated as constants. 
Finally, the algorithm replaces X(s1) V Y (s2) with Z(tı, t2), where Z(z, y) is a 
new predicate that corresponds to X (f(x, y)) V Y (g(v, y)). 

We omit the transformation algorithm for conjunctive formulas since it is 
similar to the case above, except that the new predicate Z is bound by p (note 
that condition (i) of Theorem 2 may not be satisfied), and that it converts the 
unfolded formula to a disjunctive normal form (DNF), instead of CNF. 

'The algorithm for existential formulas is shown in Algorithm 2. It unfolds X, 
normalizes existential quantifiers, and obtains a DNF. In the normalization of 
existential quantifiers, it moves existential quantifiers inwards (by using, e.g., the 
law da.(4b4 V Y2) = (Sx.v1) V (32.9)) and eliminates them as much as possible 
(by using, e.g., the equality-based quantifier elimination). For each disjunct wv; 
of the form (Sz. X (s)) ^vi, it finds t, and tz, such that f(t,, y) = [t;/z]s (again, 
by performing unification modulo arithmetic theories), and replaces the disjunct 
with Z(t,) ^ w;. Here, Z(t;) corresponds to dy.X(f(tz,y)), and t; serves as a 
witness for X(f(t,, y)) > dz.X(s). 


5.2 Implementation and Experiments 


We have implemented the transformation in a tool called MuFolder based on the 
algorithms discussed above, on top of the AdtInd theorem prover [37], using its 
routines for pattern-matching, normalization, and simplification. For the impli- 
cation checks, MUnfold uses the Z3 SMT solver [27]. MuFolder can be tested at 
https://www.kb.is.s.u-tokyo.ac.jp/~koba/mu/. 


6 Tf none of v;'s are changed in the loop on lines 3-5, we may backtrack and unfold X 
and Y more than once. 
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Table 1. Experiments. 


# |input formula & output formula &’ 
1 |Even(zr) V Odd(x + 1) vZ(x).2 — 0 V Z(x — 2) 
2 |Even(x) V Odd(x) vZ(x).(x #0 V Even(x — 1)) ^ Z(x — 1) 
3 |Even(z) V Odd(x + 1) vZ(x).c =OV Z(r —2) 
4 Mult(z + y, z,r) V 3s.Mult(z, z, s) vZ(z,y,z,r)z 20V Z(z,y,z— l,r — (x + y)) 
5 |Mult(z + y,z,r)V s1, s2. vZ(z,y,z,r)z 20V Z(x,y,z— l,r — (x + y)) 
Mult(ax, z, s1) ^ Mult(y, z, s2) ^r = sı + s2 
6 |Mult(2x + 3y, z, r) V 3s1, 82. vZ(z,y,z,r).g = 0V 
Mult(z,z,51) ^ Mult(y, z, $2) Ar = 281 + 3s2 z AOA Z(x,y,z — 1,r — (2x + 3y)) 
7 |Mult(z,y,r) V Mult(a, y, r) vZ(x,y,7T).y=OVyFOA Z(az,y—1,7r—2) 
8 |Mult(z, y,r) V MultAcc(z, y, a, r + a) vZ(r,y,a,r).y = OV 
iy ZOÜ^Z(zy-—l,r-ca,r-—z) 
9 3r.Mult(z,y,r) uZ(z,y)y - 0Vy F0A Z(a,y — 1) 
10| Plus(x + y, z, r) V ds.Plus(a, z, s) vZ(z,y,z,r)z = 0 V Z(z,y,z — 1l,r — 1) 
11) Plus(4a@ — 3y, z, r) V 381, $2. vZ(z,y.z,r)z =0V Z(z,y,z— 1l,r— 1) 
Plus(a, z, s1) A Plus(y, z, s2) ^ r = 481 — 382 
12|dr.Sum(z,r) LZ(r)z — 0v zr Z0A Z(a — 1) 


We have evaluated MuFolder on several benchmarks outlined in Table 1. 
'These benchmarks include formulas obtained from the relational and temporal 
verification properties; some of which have been taken from the benchmark set 
for Unno et al.'s induction-based CHC solver [33] and modified to include both 
u and v. We have confirmed that all the benchmark problems can be solved in 
our approach within a few seconds. To our knowledge, except the formulas 7, 8 
(for which the method of [33] can be used) and 10,11 (for which Mu2CHC works), 
Mu2CHC (without our transformation) or the existing CHC solvers cannot directly 
prove the validity of the formulas. Note that formula 12 comes from Example 6. 
The combination of the transformation with Mu2CHC enables fully automated 
verification of Example 6. 


6 Related Work 


As already mentioned, fold/unfold transformations have been originally proposed 
for logic programming [32], and later extended for CHC (a.k.a. constraint logic 
programs) [1,17]. Those transformations have originally been proposed to speed 
up program execution, but recently, Mordvinov and Fedyukovich [26] and De 
Angelis et al. [15] shown that related transformations are also useful in the con- 
text of verification based on CHC solving. Those transformations correspond to 
the transformation for the v-only fragment of MuArith." Our transformation can 
thus be considered an extension of fold/unfold-like transformations to MuArith, 
which allows alternations of least/greatest fixpoints. Sato [31] studied an exten- 
sion of fold/unfold transformations for a first-order logic, where negations and 
quantifiers are allowed in clause bodies; thus, some mixtures of least /greatest fix- 
points are allowed. The correctness of his transformation is, however, based on a 
three-valued logic, hence different from MuArith. The correctness of most of the 


T This is because, although the semantics of each predicate is interpreted as the least 
fixpoint, the predicates occur in negative positions in goal clauses. 
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transformations mentioned above is guaranteed by some syntactic conditions, 
while our transformation is based on semantic conditions. 

Unno et al. [33] proposed a method for automatically solving CHC prob- 
lems by using induction. Their method is based on a tailor-made proof system; 
hence it is difficult to integrate the method with other CHC or MuArith solvers 
(in fact, that disadvantage motivated the above-mentioned work of De Ange- 
lis et al. [15]). Their method slightly goes beyond the CHC satisfiability (or 
the v-only fragment of MuArith) but cannot deal with complex combinations 
of least/greatest fixpoints and quantifiers (like Vz, y, z, r.(Mult(x + y,z,r) = 
Js, t. Mult(z, z, s) ^ Mult(y,z,t) Ar = s + t), discussed in Section 4). 

As mentioned in Section 1, fixpoint logic-based approaches to program veri- 
fication (including CHC-based ones) have been drawing attention. Kobayashi et 
al. [22, 23,35] have shown that temporal property verification of (higher-order) 
programs can be reduced to the validity checking of (higher-order) fixpoint logic 
formulas. They proposed a concrete method for checking validity of first-order 
fixpoint formulas and implemented a validity checking tool Mu2CHC. As discussed 
already, our transformations can be used to improve the capability of Mu2CHC. 
Another thread of work on a fixpoint logic-based approach to system verifica- 
tion is that of Parameterized Boolean Equation Systems (PBES) [21]. Actually, 
MuArith may be considered an instance of PBES, where data are restricted to 
integers. Groote, Willemse, and others [10, 14, 21, 30, 36] studied applications of 
PBES to verification of infinite state systems, and devised various techniques for 
solving PBES. To our knowledge, however, they have not studied fold/unfold 
transformations for PBES. 


7 Conclusions 


We have formalized fold/unfold-like transformations for a fixpoint logic, and 
shown that they are useful for verification of relational/temporal properties of 
recursive programs. We have implemented the transformations, and shown their 
effectiveness through experiments. 
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Abstract. As a particular case study of the formal verification of state- 
of-the-art, real software, we discuss the specification and verification of 
a corrected version of the implementation of a linked list as provided by 
the Java Collection framework. 


Keywords: Java standard library - deductive verification - KeY - Java 
Modeling Language - case study - bug 


1 Introduction 


Software libraries are the building blocks of millions of programs, and they run 
on the devices of billions of users every day. Therefore, their correctness is of the 
utmost importance. The importance and potential of formal software verification 
as a means of rigorously validating state-of-the-art, real software and improving 
it, is convincingly illustrated by its application to TimSort, the default sorting 
library in many widely used programming languages, including Java and Python, 
and platforms like Android (see [7,9]): a crashing implementation bug was found. 

'The Java implementation of TimSort belongs to the Java Collection frame- 
work which provides implementations of basic data structures and is among the 
most widely used libraries. Nonetheless, over the years, 877 bugs in the Collec- 
tions Framework have been reported in the official OpenJDK bug tracker. 

Due to the intrinsic complexity of modern software, the possibility of inter- 
ventions by a human verifier is indispensable for proving correctness. This holds 
in particular for the Java Collection library, where programs are expected to be- 
have correctly for inputs of arbitrary size. As a particular case study, we discuss 
the formal verification of a corrected version of the implementation of a linked 
list as specified by the class LinkedList of the Java Collection framework in 
Java 8. Apart from the fact that the data structure of a linked list is one of the 
basic structures for storing and maintaining unbounded data, this is an inter- 
esting case study because it provides further evidence that formal verification of 
real software can lead to major improvements and correctness guarantees. 


(9 The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 217-234, 2020. 
https: //doi.org/10.1007/978-3-030-45237-7_13 
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We follow the general workflow underlying the Tim- 

"me Sort case as depicted in Fig. 1. The workflow starts 
Specification with a formalisation of the informal documentation of 
the Java code in the Java Modeling Language [10,16]. 
'This formalisation goes hand in hand with the formal 
Verification |-—— verification: failed verification attempts can provide in- 
formation about further refinements of the specs. A 


Y 


i failed verification attempt may also indicate an error 
Test-case in the code, and can as such be used for the generation 
Generation 


of test cases to detect the error at run-time. 

LinkedList is the only List implementation in the 
"d || Collection Framework that allows collections of un- 
Revision bounded size. During verification we found out that 
the Java linked list implementation does not correctly 
take into account the Java integer overflow seman- 

Fig. 1: Workflow tics. It is exactly for large lists (> 2?! items), that 

the implementation breaks. This basic observation gave 
rise to a number of test cases which show that Java’s 
LinkedList class breaks 22 methods out of a total of 25 methods of the List!* 

On the basis of these test cases we propose in Sect. 2 a code revision of the 
Java linked list implementation, and formally specify and verify its correctness 
in Sect. 3 with respect to the Java integer overflow semantics. Section 4 discusses 
the main challenges posed by this case study and related work. 

This case study has been carried out using the state-of-the-art KeY theorem 
prover [3], because it formalizes the integer overflow semantics of Java and it 
allows to directly “load” Java programs. An archive of proof files and the KeY 
version used in this study is available on-line in the Zenodo repository [2]. 


Y 


2 LinkedList in OpenJDK 


LinkedList was introduced in Java version 1.2 as part of Java's Collection 
Framework in 1998. The LinkedList class is part of the type hierarchy of this 
framework: LinkedList implements the List interface, and also supports all 
general Collection methods as well as the methods from the Queue and Deque 
interfaces. The List interface provides positional access to the elements of the 
list, where each element is indexed by Java's primitive int type. 

The structure of the LinkedList class is shown in Listing 1. This class has 
three attributes: a size field, which stores the number of elements in the list, 
and two fields that store a reference to the first and last node. Internally, 
it uses the private static nested Node class to represent the items in the list. 
A static nested private class behaves like a top-level class, except that it is not 
visible outside the enclosing class (LinkedList, in this case). Nodes are doubly 
linked; each node is connected to the preceding (field prev) and succeeding node 


^ We filed a bug report to Oracle's security team. Once the report is made public by 
the Java maintainers, we will add the URL as metadata to our repository [2]. 
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public class LinkedList<E> 
extends AbstractSequentialList<E> 
implements List<E>, Deque«E», ... { 
transient int size = 0; 
transient Node<E> first; 
transient Node<E> last; 
private static class Node<E> { 
E item; 
Node<E> next; 
Node<E> prev; last = newNode; 


Node(Node<E> p, E i, Node«E» n) ... if (1 == null) first = newllode; 
} else l.next = newNode; 


sizet+; 
} modCount++}; 


public boolean add(E e) { 
linkLast(e); 
return true; 

F 

void linkLast(E e) { 
final Node<E> 1 = last; 
final Node<E> newNode = 

new Node<>(1, e, null); 


} 
Listing 1: The LinkedList class defines a doubly-linked list data structure. 


public int index0f(Object o) { 
int index = 0; 
if (o == null) { 
for (Node<E> x = first; x !- null; x = x.next) { 
if (x.item -- null) 
return index; 
indext*; 
} 
} else { 
for (Node<E> x = first; x != null; x = x.next) { 
if (o.equals(x.item)) 
return index; 
indext*t; 
F 
} 


return -1; 


Listing 2: The index0f method searches for an element from the first node on. 


(field next). These fields contain null in case no preceding or succeeding node 
exists. The data itself is contained in the item field of a node. 


LinkedList contains 57 methods. Due to space limitations, we now focus on 
three characteristic methods: see Listing 1 and Listing 2. Method add(E) calls 
method linkLast (E), which creates a new Node object to store the new item 
and adds the new node to the end of the list. Finally the new size is determined 
by unconditionally incrementing the value of the size field, which has type 
int. Method indexOf (Object) returns the position (of type int) of the first 
occurrence of the specified element in the list, or —1 if it’s not present. 


Each linked list consists of a sequence of nodes. Sequences are finite, indexing 
of sequences starts at zero, and we write c[i| to mean the ith element of some 
sequence c. A chain is a sequence c of nodes of length n > 0 such that: the 
prev reference of the first node c [0] is nu11, the next reference of the last node 
c [n — 1] is null, the prev reference of node cfi] is node o[i — 1] for every index 
0 « i « n, and the next reference of node o[i| is node c[i + 1] for every index 
0 € i « n—1. The first and last references of a linked list are either both null 
to represent the empty linked list, or there is some chain ø between the first 
and last node, viz. c[0] = first and o[n — 1] = last. Figure 2 shows example 
instances. Also see standard literature such as Knuth's [15, Section 2.2.5]. 
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LinkedList LinkedList LinkedList 
first |size | last first [size |last first [size | last 

|—e e+ 
null] o {nua} J^ 1 ^ 2 


Node ] Node Node 
NC item | next prev |item | next | 7 prev item | next 


e e e= |—® e 
null null null re | null 


Fig. 2: Three example linked lists: empty, with a chain of one node, and with a 
chain of two nodes. Items themselves are not shown. 


We make a distinction between the actual size of a linked list and its cached 
size. In principle, the size of a linked list can be computed by walking through 
the chain from the first to the last node, following the next reference, and 
counting the number of nodes. For performance reasons, the Java implementation 
also maintains a cached size. The cached size is stored in the linked list instance. 

Two basic properties of doubly-linked lists are acyclicity and unique first and 
last nodes. Acyclicity is the statement that for any indices 0 € i < j < n the 
nodes c[i| and o[j] are different. First and last nodes are unique: for any index 
i such that co[i| is a node, the next of o [i] is null if and only if i = n — 1, and 
prev of c [i] is null if and only if i = 0. Each item is stored in a separate node, 
and the same item may be stored in different nodes when duplicate items are 
present in the list. 


2.1 Integer overflow bug 


'The size of a linked list is encoded by a signed 32-bit integer (Java's primitive 
int type) that has a two's complement binary representation where the most 
significant bit is a sign bit. The values of int are bounded and between —2?! 
(Integer.MIN.VALUE) and 2?! — 1 (Integer.MAX VALUE), inclusive. Adding one 
to the maximum value, 2?! — 1, results in the minimum value, —2?!: the carry 
of addition is stored in the sign bit, thereby changing the sign. 
Since the linked list implementation maintains one node for each element, its 
size is implicitly bounded by the number of node instances that can be created. 
Until 2002, the JVM was limited to a 32-bit address space, imposing a limit of 
gigabytes (GiB) of memory. In practice this is insufficient to create 2?! node 
instances. Since 2002, a 64-bit JVM is available allowing much larger amounts 
of addressable memory. Depending on the available memory, in principle it is 
now possible to create 2?! or more node instances. In practice such lists can be 
constructed today on systems with 64 gigabytes of memory, e.g., by repeatedly 
adding elements. However, for such large lists, at least 20 methods break, caused 
by signed integer overflow. For example, several methods crash with a run-time 
exception or exhibit unexpected behavior! 

Integer overflow bugs are a common attack vector for security vulnerabilities: 
even if the overflow bug may seem benign, its presence may serve as a small step 
in a larger attack. Integer overflow bugs can be exploited more easily on large 


Aa 


Verifying OpenJDK’s LinkedList using KeY 221 


memory machines used for ‘big data’ applications. Already, real-world attacks 
involve Java arrays with approximately 27/5 elements [11, Section 3.2]. 

The Collection interface allows for collections with over Integer .MAX_- 
VALUE elements. For example, its documentation (Javadoc) explicitly states the 
behavior of the size() method: ‘Returns the number of elements in this collec- 
tion. If this collection contains more than Integer .MAX_VALUE elements, returns 
Integer .MAX_VALUE’. The special case (‘more than ...’) for large collections is 
necessary because size() returns a value of type int. 

When add(E) is called and unconditionally increments the size field, an 
overflow happens after adding 2?! elements, resulting in a negative size value. 
In fact, as the Javadoc of the List interface describes, this interface is based on 
integer indices of elements: ‘The user can access elements by their integer index 
(position in the list), ...’. For elements beyond Integer .MAX_VALUE, it is very 
unclear what integer index should be used. Since there are only 2°? different 
integer values, at most 2?? node instances can be associated with an unique 
index. For larger lists, elements cannot be uniquely addressed anymore using 
an integer index. In essence, as we shall see in more detail below, the bounded 
nature of the 32-bit integer indices implies that the design of the List interfaces 
breaks down for large lists on 64-bit architectures. The above observations have 
many ramifications: it can be shown that 22 of 25 methods in the List interface 
are broken. Remarkably, the actual size of the linked list remains correct as the 
chain is still in place: most methods of the Queue interface still work. 


2.2 Reproduction 


We have run a number of test cases to show the presence of bugs caused by the 
integer overflow. The running Java version was Oracle's JDK8 (build 1.8.0 201- 
b09) that has the same LinkedList implementation as in OpenJDK8. Before 
running a test case, we set up an empty linked list instance. Below, we give an 
high-level overview of the test cases. Each test case uses letSizeOverflow() or 
addElementsUntilSizeIsO OO: these repeatedly call the method add O to fill the 
linked list with null elements, and the latter method also adds a last element 
("this is the last element") causing size to be 0 again. 


1. Directly after size overflows, the size() methods returns a negative value, 
violating what the corresponding Javadoc stipulates: its value should remain 
Integer.MAX VALUE — 2?! — 1. 


letSizeOverflow(); 
System.out.println("linkedList.size() = " + linkedList.size() + ", actual: " + count); 
// linkedList.size() = -2147483648, actual: 2147483648 


Clearly this behavior is in contradiction with the documentation. The actual 
number of elements is determined by having a field count (of type long) that 
is incremented each time the method add O is called. 

2. The query method get (int) returns the element at the specified position in 
the list. It throws an IndexOutOfBoundsException exception when size is 
negative. From the informal specification, it is unclear what indices should 
be associated with elements beyond Integer .MAX_VALUE. 
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letSizeOverflow(); 

System.out.println(linkedList.get(0)); 

// Exception in thread "main" IndexOutOfBoundsException: Index: 0, Size: -2147483648 
// at java.util.LinkedList.checkElementIndex(LinkedList.java:555) ... 

3. The method toArray() returns an array containing all of the elements in 
this list in proper sequence (from first to last element). When size is neg- 
ative, this method throws a NegativeArraySizeException exception. Fur- 
thermore, since the array size is bounded by 2?! — 1 elements?, the contract 
of toArray O is unsatisfiable for lists larger than this. The method Collec- 
tions.sort(List«T») sorts the specified list into ascending order, according 
to the natural ordering of its elements. This method calls toArray(), and 
therefore also throws a NegativeArraySizeException. 


letSizeOverflow(); 

Collections.sort(linkedList); 

// Exception in thread "main" NegativeArraySizeException 

Vt at java.util.LinkedList.toArray(LinkedList.java:1050)... 


4. Method indexOf(0bject o) returns the index of the first occurrence of the 
specified element in this list, or —1 if this list does not contain the element. 
However due to the overflow, it is possible to have an element in the list 
associated to index —1, which breaks the contract of this method. 


addElementsUntilSizeIs0() ; 
String last; 


System.out.println("linkedList.getLast() = " + (last = linkedList.getLast())); 
// linkedList.getLast() = This is the last element 
System.out.println("linkedList.indexOf(" + last + ") = " + linkedList.indexOf(last)); 


// VinkedList.indexOf (This is the last element) = -1 


5. Method contains(Object o) returns true if this list contains the specified 
element. If an element is associated with index —1, it will indicate wrongly 
that this particular element is not present in the list. 


addElementsUntilSizeIs0() ; 
String last; 


System.out.println("linkedList.getLast() = " + (last = linkedList.getLast())); 
// linkedList.getLast() = This is the last element 
System.out.println("linkedList.contains(" + last + ") = " linkedList.contains(last)) ; 


// linkedList.contains(This is the last element) = false 


Specifically, method letSizeOverflow() adds 2?! elements that causes the 
overflow of size. Method addElementsUntilSizeIsO() first adds 2°? — 1 ele- 
ments: the value of size is then —1. Then, it adds the last element, and size 
is 0 again. All elements added are null, except for the last element. For test 
cases 4 and 5, we deliberately misuse the overflow bug to associate an element 
with index —1. This means that method indexOf (Object) for this element re- 
turns —1, which according to the documentation means that the element is not 
present. For test cases 1, 2 and 3 we needed 65 gigabytes of memory for the JRE 
on a VM with 67 gigabytes of memory. For test cases 4 and 5 we needed 167 
gigabytes of memory for the JRE on a VM with 172 gigabytes of memory. All 
test cases were carried out on a machine in a private cloud (SURFsara), which 
provides instances that satisfy these system requirements. 


5 In practice, the maximum array length turns out to be 2?! — 5, as some bytes are 
reserved for object headers, but this may vary between Java versions [11,14]. 
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2.3 Mitigation 


There are multiple directions for mitigating the overflow bug: do not fix, fail fast, 
long size field and long or BigInteger indices. Due to lack of space, we describe 
only the fail fast solution. This solution stays reasonably close to the original 
implementation of LinkedList and does not leave any behavior unspecified. 

In the fail fast solution, we ensure that the overflow of size may never occur. 
Whenever elements would be added that cause the size field to overflow, the 
operation throws an exception and leaves the list unchanged. As the exception 
is triggered right before the overflow would otherwise occur, the value of size is 
guaranteed to be bounded by Integer .MAX_VALUE, i.e. it never becomes negative. 

This solution requires a slight adaptation of the implementation: for meth- 
ods that increase the size field, only one additional check has to be performed 
before a LinkedList instance is modified. This checks whether the result of 
the method causes an overflow of the size field. Under this condition, an I1- 
legalStateException is thrown. Thus, only in states where size is less than 
Integer.MAX VALUE, it is acceptable to add a single element to the list. 

We shall work in a separate class called BoundedLinkedList: this is the im- 
proved version that does not allow more than 2?! — 1 elements. Compared to the 
original LinkedList, two methods are added, isMaxSize() and checkSize(): 
private boolean isMaxSize() { 

return size == Integer.MAX, VALUE; 
UN void checkSize() { 
if (isMaxSize()) 


throw new IllegalStateException("Not enough space"); 
} 


These methods implement an overflow check. The latter method is called be- 
fore any modification occurs that increases the size by one: this ensures that 
size never overflows. Some methods now differ when compared to the original 
LinkedList, as they involve an invocation of the checkSize() method. 


3 Specification and verification of BoundedLinkedList 


The aim of our specification and verification effort is to verify formalizations of 
the given Javadoc specifications (stated in natural language) of the LinkedList. 
This includes establishing absence of overflow errors. Moreover, we restrict our 
attention only to the revised BoundedLinkedList and not to the rest of the 
Collection Framework or Java classes: methods that involve parameters with 
interface types, Java serialization or Java reflection are considered out of scope. 

(Bounded)LinkedList inherits from AbstractSequentialList, but we con- 
sider its inherited methods out of scope. These methods operate on other collec- 
tions such as removeA11 or containsA11, and methods that have other classes as 
return type such as iterator. However, these methods call methods overridden 
by (Bounded)LinkedList, and can not cause an overflow by themselves. 

We have made use of KeY's stub generator to generate dummy contracts 
for other classes that BoundedLinkedList depends on, such as for the inherited 
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interfaces and abstract super classes: these contracts conservatively specify that 
every method may arbitrarily change the heap. The stub generator moreover 
deals with generics by erasing the generic type parameters. For exceptions we 
modify their stub contract to assume that their constructors are pure, viz. leav- 
ing existing objects on the heap unchanged. An important stub contract is the 
equality method of the absolute super class Object, which we have adapted: we 
assume every object has a side-effect free, terminating and deterministic imple- 
mentation of its equality method®: 
public class Object { 
/*@ public normal_behavior 

@ requires true; 

@ ensures \result == self.equals(paramO); 

@*/ 


public /*@ helper strictly_pure @*/ boolean 
equals(/*@ nullable */ Object paramO); 


3.1 Specification 


Following our workflow, we have iterated a number of times before the specifica- 
tions we present here were obtained. This is a costly procedure, as revising some 
specifications requires redoing most verification effort. Until sufficient informa- 
tion is present in the specification, proving for example termination of a method 
is difficult or even impossible: from stuck verification attempts, and an intuitive 
idea of why a proof is stuck, the specification is revised. 

Ghost fields. We use JML’s ghost fields: these are logical fields that for each 
object gets a value assigned in a heap. The value of these fields are conceptual, 
i.e. only used for specification and verification purposes. During run-time, this 
field is not present and cannot affect the course of execution. Our improved class 
is annotated with two ghost fields: nodeList and nodeIndex. 

The type of the nodeList ghost field is an abstract data type of sequences, 
a KeY built-in. This type has standard constructors and operations that can be 
used in contracts and in JML set annotations. A sequence has a length, which 
is finite but unbounded. The type of a sequence’s length is \bigint. In KeY a 
sequence is unityped: all its elements are of the any sort, which can be any Java 
object reference or primitive, or built-in abstract data type. One needs to apply 
appropriate casts and track type information for a sequence of elements in order 
to cast elements of the any sort to any of its subsorts. 

The nodeIndex ghost field is used as a ghost parameter with unbounded 
but finite integers as type. This ghost parameter is only used for specifying 
the behavior of the methods unlink(Node) and linkBefore(Object, Node). 
The ghost parameter tracks at which index the Node argument is present in the 
nodeList. This information is implicit and not needed at run-time. 


$ In reality, there are Java classes for which equality is not terminating. A nice example 
is LinkedList itself, where adding a list to itself leads to a StackOverflowError when 
testing equality with a similar instance. We consider the issue out of scope of this 
study as this behavior is explicitly described by the Javadoc. 
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Class invariant. The ghost field nodeList is used in the class invariant of 
our improved implementation, see below. We relate the fields first and last 
that hold a reference to a Node instance, and the chain between first and last, 
to the contents of the sequence in the ghost field nodeList. This allows us to 
express properties in terms of nodeList, where they reflect properties about 
the chain on the heap. One may compare this invariant with the description of 
chains as given in Sect. 2. 


1 //® private ghost \seq nodeList; 

2  //Q0 private ghost \bigint nodeIndex; 

3  /*0 invariant 

4 €  nodeList.length == size && 

5 €  nodeList.length «- Integer.MAX VALUE && 

6 e (\forall \bigint i; 0 <= i < nodeList.length; 

7 @ nodeList[i] instanceof Node) && 

8 @  ((nodelist == \seq_empty && first == null && last == null) 
9 @ || (modeList != \seq_empty && first != null && 

10 @ first.prev == null && last != null && 

11 @ last.next == null && first == (Node)nodeList[0] && 
12 @ last == (Node)nodeList [nodeList.length-1])) && 

13 € (\forall \bigint i; 0 < i < nodeList.length; 

14 @ ((Node)nodeList [i]).prev == (Node)nodeList[i-1]) && 
15 @ (\forall \bigint i; 0 <= i < nodeList.length-1; 

16 @ ((Node)nodeList[i]).next == (Node)nodeList [i+1]) ; 

17 @*/ 


The actual size of a linked list is the length of the ghost field nodeList, 
whereas the cached size is stored in a 32-bit signed integer field size. On line 4, 
the invariant expresses that these two must be equal. Since the length of a 
sequence (and thus nodeList) is never negative, this implies that the size field 
never overflows. On line 5, this is made explicit: the real size of a linked list is 
bounded by Integer.MAX VALUE. Line 5 is redundant as it follows from line 4, 
since a 32-bit integer never has a value larger than this maximum value. The 
condition on lines 6-7 requires that every node in nodeList is an instance of 
Node which implies it is non-null. 

A linked list is either empty or non-empty. On line 8, if the linked list is 
empty, it is specified that first and last must be null references. On lines 
9-12, if the linked list is non-empty, it is specified that first and last are non- 
null and moreover that the prev field of the first Node and the next field of the 
last Node are null. The nodeList must have as first element the node pointed 
to by first, and last as last element. In any case, but vacuously true if the 
linked list is empty, the nodeList forms a chain of nodes: lines 13-16 describe 
that, for every node at index 0 < i < size, the prev field must point to its 
predecessor, and similar for successor nodes. 

We note three interesting properties that are implied by the above invariant: 
acyclicity, unique first and unique last node. These properties can be expressed 
as JML formulas as follows: 

(\forall \bigint i; 0 <= i < nodeList.length - 1; 
(\forall \bigint j; i < j < nodeList.length; 
nodeList[i] != nodeList[j])) && 
(\forall \bigint i; 0 <= i < nodeList.length; 
nodeList [i] .next == null <==> i = nodeList.length - 1) && 


(\forall \bigint i; 0 <= i < nodeList.length; 
nodeList [i] .prev == null <==> i = 0) 
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These properties are not literally part of our invariant, but their validity is proven 
interactively in KeY as a consequence of the invariant. Otherwise, we would need 
to reestablish also these properties each time we show the invariant holds. 

Methods. All methods within scope are given a JML contract that specify its 
normal behavior and its exceptional behavior. As an example contract, consider 
the lastIndex0f (Object) method in Listing 3: it searches through the chain of 
nodes until it finds a node with an item equal to the argument. This method is in- 
teresting due to a potential overflow of the resulting index. BoundedLinkedList 
together with all method specifications are available on-line [2]. 


3.2 Verification 


We start by giving a general strategy we apply to verify proof obligations. 
We also describe in more detail how to produce a single proof, in this case 
lastIndexOf (Object). This gives a general feel how proving in KeY works. 
This method is neither trivial, nor very complicated to verify. In this manner, 
we have produced proofs for each method contract that we have specified. 
Overview of verification steps. When verifying a method, we first instruct 
KeY to perform symbolic execution. Symbolic execution is implemented by a 
number of proof rules that transform modal operators on program fragments in 
JavaDL. During symbolic execution, the goal sequent is automatically simplified, 
potentially leading to branches. Since our class invariant contains a disjunction 
(either the list is empty or not), we do not want these cases to be split early 
in the symbolic execution. Thus we instruct KeY to delay unfolding the class 
invariant. When symbolic execution is finished, goals may still contain updated 
heap expressions that must be simplified further. After this has been done, one 
can compare the open goals to the method body and its annotations, and see 
whether the open goals in KeY look familiar and check whether they are true. 
In the remaining part of the proof the user must find an appropriate mix 
between interactive and automatic steps. If a sequent is provable, there may be 
multiple ways to construct a closed proof tree. At (almost) every step the user 
has a choice between applying steps manually or automatically. It requires some 
experience in choosing which rules to apply manually: clever rule application 
decreases the size of the proof tree. Certain rules are never applied automatically, 
such as the cut rule. The cut rule splits a proof tree into two parts by introducing 
a detour, but significantly reduces the size of a proof and thus the effort required 
to produce it. For example, the acylicity property can be introduced using cut. 
Verification example. The method lastIndexOf has two contracts: one in- 
volves a null argument, and another involves a non-null argument. Both proofs 
are similar. Moreover, the proof for indexOf(...) is similar but involves the 
next reference instead of the prev reference. This contract is interesting, since 
proving its correctness shows the absence of the overflow of the index variable. 


Proposition. lastIndex0f (Object) as specified in Listing 3 is correct. 


Proof. Set strategy to default strategy, and set max. rules to 5,000, class axiom 
delayed. Finish symbolic execution on the main goal. Set strategy to 1,000 rules 


/*@ 


@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
@ 
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also 


public normal_behavior 


@*/ 
public /*@ strictly_pure @*/ int 
lastIndex0f(/*@ nullable @*/ Object o) { 
int index = size; 
if (o == null) { 


Y 


requires 

o != null; 
ensures 

\result >= -1 && Nresult < nodeList.length; 
ensures 

\result == -1 == 

(\forall \bigint i; 0 <= i < nodeList.length; 
!o.equals(((Node)nodelist[i]l).item)); 

ensures 

\result >= 0 ==> 


(\forall \bigint i; \result < i < nodeList.length; 


!o.equals(((Node)nodelist[i]).item)) && 
o.equals(((Node)nodeList [\result]) item); 


maintaining 


!o.equals(((Node)nodeList[i]).item)); 

maintaining 

O0 <= index && index <= nodeList.length; 
maintaining 

O < index && index <= nodeList.length == 

x == (Node)nodeList [index - 1]; 

maintaining 

index == <==> x == null; 
decreasing 

index; 
assignable 

\strictly_nothing; 


oooooooooooooo 


e 


Qx/ 
for (Node x = last; x !- null; x = x.prev) { 
index--; 
if (o.equals(x.item)) 
return index; 


Y 


return -1; 


(\forall \bigint i; index <= i < nodeList.length; 
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Listing 3: Method lastIndex0f (Object) annotated with JML. Searches the list 
from last to first for an element. Returns —1 if this element is not present in the 
list; otherwise returns the index of the node that was equal to the argument. 
Only the contract and branch in which the argument is non-null is shown due 
to space restrictions. Methods such as index0f, removeFirstOccurrence and 
removeLastOccurrence are very similar. 
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and select DefOps arithmetical rules. Close all provable goals under the root. 
One goal remains. Perform update simplification macro on the whole sequent, 
perform propositional with split macro on the sequent, and close provable goals 
on the root of the proof tree. There is a remaining case: 


— Case index — 1 = 0 © x.prev = null: split the equivalence. First case, 
suppose index — 1 = 0, then x = self nodeList|0] = self.first and 
self .£irst.prev = null: solvable through unfolding the invariant and equa- 
tional rewriting. Now, second case, suppose r.prev = null. Then, either 
index = 1 or index > 1 (from splitting index > 1). The first of which 
is trivial (close provable goal), and the second one requires instantiating 
quantified statements from the invariant, leading to a contradiction. Since 
we have supposed r.prev = null, but x = self. nodeList|index — 1] and 
self nodeList[index — 1].prev = self nodeList|index — 2] and 
self nodeList[index — 2] # null. 


Interesting verification conditions. The acyclicity property is used to close 
verification conditions that arise as a result of potential aliasing of node in- 
stances: it is used as a separation lemma. For example, after a method that 
changed the next field of an existing node, we want to reestablish that all nodes 
remain reachable from the first through next fields (i.e., “connectedness” ): 
one proves that the update of next only affects a single node, and does not 
introduce a cycle. We prove this by using the fact that two nodes instances are 
different if they have a different index in nodeList, which follows from acyclic- 
ity. Below, we sketch an argument why the acyclicity property follows from the 
invariant. We have a video in which we show how the argument in KeY goes, 
see [1, 0:55-11:30]. 


Proposition. Acyclicity follows from the linked list invariant. 


Proof. By contradiction: suppose a linked list of size n > 1 is not acyclic. Then 
there are two indices, 0 € i < j < n, such that the nodes at index i and j are 
equal. Then it must hold that for all j € k « n, the node at k is equal to the 
node at k — (j — i). This follows from induction. Base case: if k — j, then node j 
and node j — (j — i) = i are equal by assumption. Induction step: suppose node 
at k is equal to node at k — (j — i), then if k + 1 < n it also holds that node 
k +1 equals node k + 1 — (j — i): this follows from the fact that node k +1 and 
k +1 —-(j — i) are both the next of node k < n — 1 and node k — (j — i). Since 
the latter are equal, the former must be equal too. Now, for all j € k « n, node 
k equals node k — (j — i) in particular holds when k = n — 1. However, by the 
property that only the last node has a null value for next, and a non-last node 
has a non-null value for its next field, we derive a contradiction: if nodes k and 
k — (j — i) are equal then all their fields must also have equal values, but node 
k has a null and node k — (j — i) has a non-null next field! 
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Summary of verification effort. The total effort of our case study was about 
7 man months. The largest part of this effort is finding the right specification. 
KeY supports various ways to specify Java code: model fields/methods, pure 
methods, and ghost variables. For example, using pure methods, contracts are 
specified by expressing the content of the list before/after the method using 
the pure method get(i), which returns the item at index i. This led to rather 
complex proofs: essentially it led to reasoning in terms of relational properties 
on programs (i.e. get (i) before vs get(i) after the method under consideration). 
After 2.5 man months of writing partial specifications and partial proofs in these 
different formalisms, we decided to go with ghost variables as this was the only 
formalism in which we succeeded to prove non-trivial methods. 

It then took ~ 4 man months of iterating in our workflow through (failed) 
partial proof attempts and refining the specs until they were sufficiently com- 
plete. In particular, changes to the class invariant were “costly”, as this typically 
caused proofs of all the methods to break (one must prove that all methods pre- 
serve the class invariant). The possibility to interact with the prover was crucial 
to pinpoint the cause of a failed verification attempt, and we used this feature 
of KeY extensively to find the right changes/additions to the specifications. 

After the introduction of the field nodeList, several methods could be proved 
very easily, with a very low number of interactive steps or even automatically. 
Methods unlink(Node) and linkBefore(Object, Node) could not be proven 
without knowing the position of the node argument. We introduced a new ghost 
field, nodeIndex, that acts like a ghost parameter. Luckily, this did not affect 
the class invariant, and existing proofs that did not make use of the new ghost 
field were unaffected. 

Once the specifications are (sufficiently) complete, we estimate that it only 
took approximately 1 or 1.5 man weeks to prove all methods. This can be reduced 
further if informal proof descriptions are given. Moreover, we have recorded a 
video of a 30 minute proof session where the method unlinkLast is proven 
correct with respect to its contract [1]. 

Proof statistics. The below table summarizes the main proof statistics for all 
methods. The last two columns are not metrics of the proof, but they indicate 
the total lines of code (LoC) and the total lines of specifications (LoSpec). 


Rules |Branches|Interactive steps|Quant.ins|Contract|LoopInv|LoC|LoSpec 
375,839] 2,477 9,609 2,322 79 12 |440| 756 


We found the most difficult proofs were for the method contracts of: clear O, 
linkBefore (Object ,Node), unlink(Node), node(int) and remove(Object). 
The number of interactive steps seem a rough measure for effort required. But, 
we note that it is not a reliable representation of the difficulty of a proof: an 
experienced user can produce a proof with very few interactive steps, while an 
inexperienced user may take many more steps. The proofs we have produced are 
by no means minimal. 
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4 Discussion 


In this section we discuss some of the main challenges of verifying the real-world 
Java implementation of a LinkedList, as opposed to the analysis of an idealized 
mathematical linked list. 

Extensive use of Java language constructs. The LinkedList class uses a wide 
range of Java language features. This includes nested classes (both static and 
non-static), inheritance, polymorphism, generics, exception handling, object cre- 
ation and foreach loops. To load and reason about the real-world LinkedList 
source code requires an analysis tool with high coverage of the Java language, 
including support for the aforementioned language features. 

Support for intricate Java semantics. The Java List interface is position 
based, and associates with each item in the list an index of Java’s int type. The 
bugs described in Section 2.1 were triggered on large lists, in which integer over- 
flows occurred. Thus, while an idealized mathematical integer semantics is much 
simpler for reasoning, it could not be used to analyze the bugs we encountered! 
It is therefore critical that the analysis tool faithfully supports Java’s semantics, 
including Java’s integer (overflow) behavior. 

Collections have a huge state space. A Java collection is an object that con- 
tains other objects (of a reference type). Collections can typically grow to an 
arbitrary (but in practice, bounded) size. By their very nature, collections thus 
intrinsically have a large state. To make this more concrete: triggering the bugs 
in LinkedList requires at least 2?! elements (and 64 GiB of memory), and each 
element, since it is of a reference type, has at least 2?? values. This poses serious 
problems to fully automated analysis methods that explore the state space. 

Interface specifications. Several of the LinkedList methods contain an inter- 
face type as parameter. For example, the addA11 method takes two arguments, 
the second one is of the Collection type: 


public boolean addAll(int index, Collection c) { 


Object[] a = c.toArray(); 


As KeY follows the design by contract paradigm, verification of LinkedList’s 
addAll method requires a contract for each of the other methods called, includ- 
ing the toArray method in the Collection interface. How can we specify in- 
terface methods, such as Collection.toArray? The stub generator generates a 
conservative contract: it may arbitrarily modify the heap and return any array. 
Simple conditions on parameters or the return value are easily expressed, but 
meaningful contracts that relates the behavior of the method to the contents of 
the collection require some notion of state to capture all mutations of the collec- 
tion, so that previous calls to methods in the interface that contributed to the 
current contents of the collection are taken into account. Model fields/methods 
[3, Section 9.2] are a widely used mechanism for abstract specification. A model 
field or method is represented in a concrete class in terms of the concrete state 
given by its fields. In this case, as only the interface type Collection is known 
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rather than a concrete class, such a representation cannot be defined. Thus the 
behavior of the interface cannot be fully captured by specifications in terms 
of model fields/variables, including for methods such as Collection.toArray. 
Ghost variables cannot be used either, since ghost variables are updated by 
adding set statements in method bodies, and interfaces do not contain method 
bodies. This raises the question: how to specify behavior of interface methods?" 


Verifiable code revisions. We fixed the LinkedList class by explicitly bound- 
ing its maximum size to Integer.MAX VALUE elements, but other solutions are 
possible. Rather than using integers indices for elements, one could change to 
an index of type long or BigInteger. Such a code revision is however incompati- 
ble with the general Collection and List interfaces (whose method signatures 
mandate the use of integer indices), thereby breaking all existing client code that 
uses LinkedList. Clearly this is not an option in a widely used language like 
Java, or any language that aims to be backwards compatible. 


It raises the challenge: can we find code revisions that are compatible with 
existing interfaces and client classes? We can take this challenge even further: 
can we use our workflow to find such compatible code revisions, and are also 
amenable to formal verification? The existing code in general is not designed 
for verification. For example, the LinkedList class exposes several implementa- 
tion details to classes in the java.util package: i.e., all fields, including size, are 
package private (not private!), which means they can be assigned a new value 
directly (without calling any methods) by other classes in that package. This 
includes setting size to negative values. As we have seen, the class malfunctions 
for negative size values. In short, this means that the LinkedList itself cannot 
enforce its own invariants anymore: its correctness now depends on the correct- 
ness of other classes in the package. The possibility to avoid calling methods 
to access the lists field may yield a small performance gain, but it precludes a 
modular analysis: to assess the correctness of LinkedList one must now ana- 
lyze all classes in the same package (!) to determine whether they make benign 
changes (if any) to the fields of the list. Hence, we recommend to encapsulate 
such implementation details, including making at least all fields private. 


Proof reuse. Section 3.2 discussed the proof effort (in person months). It 
revealed that while the total effort was 6-7 person months, once the specifications 
are in place after many iterations of the workflow, producing the actual final 
proofs took only 1-2 weeks! But minor specification changes often require to redo 
nearly the whole proof, which causes much delay in finding the right specification. 
Other program verification case studies [3,4,8,9] show similarly that the main 
bottleneck today is specification, not verification. This calls for techniques to 
optimize proof reuse when the specification is slightly modified, allowing for a 
more rapid development of specifications. 


T Since the representation of classes that implement the interface is unknown in the 
interface itself, a particularly challenging aspect here is: how to specify the footprint 
of an interface method, i.e.: what part of the heap can be modified by the method 
in the implementing class? 
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Status of the challenges. Most of these challenges are still open. The challenge 
concerning “Interface specifications” could perhaps be addressed by defining an 
abstract state of an interface by using/developing some form of a trace specifi- 
cation that map a sequence of calls to the interface methods to a value, together 
with a logic to reason about such trace specifications. 

The challenges related to code revisions and proof reuse are compounded 
for analysis tools that use very fine-grained proof representations. For exam- 
ple, proofs in KeY consist of actual rule applications (rather than higher level 
macro/strategy applications), and proof rule applications explicitly refer to the 
indices of the (sub) formulas the rule is applied to. This results in a fragile proof 
format, where small changes to the specifications or source code (such as a code 
refactoring) break the proof. 

The KeY system covered the Java language features sufficiently to load and 
statically verify the LinkedList source code. KeY also supports various integer 
semantics, allowing us to analyze LinkedList with the actual Java integer over- 
flow semantics. As KeY is a theorem prover (based on deductive verification), 
it does not explore the state space of the class under consideration, thus solving 
the problem of the huge state space of Java collections. We could not find any 
other tools that solved these challenges, so we decided at that point to use KeY. 

However, other state-of-the-art systems such as Coq, Isabelle and PVS sup- 
port proof scripts. Those proofs are described at a typically much more coarse- 
grained level when compared to KeY. It would be interesting to see to what 
extent Java language features and semantics can be handled in (extensions of) 
such higher level proof script languages. 


4.1 Related work 


Knüppel et al. [14] provide a report on the specification and verification of some 
methods of the classes ArrayList, Arrays, and Math of the OpenJDK Collec- 
tions framework using KeY. Their report is mainly meant as a "stepping stone 
towards a case study for future research." To the best of our knowledge, no for- 
mal specification and verification of the actual Java implementation of a linked 
list has been investigated. In general, the data structure of a linked list has been 
studied mainly in terms of pseudo code of an idealized mathematical abstraction 
(see [18] for an Eiffel version and [12] for a Dafny version). 

This paper (and [14]) has shown that the specification and verification of ac- 
tual library software poses a number of serious challenges to formal verification. 
In our case study, we used KeY to verify Java’s linked list. Other formalizations 
of Java also exists, such as Bali [17] and Jinja [13] (using the general-purpose 
theorem prover Isabelle/HOL), OpenJML [6] (a prover dedicated to Java pro- 
grams), and VerCors [5] (focusing on concurrent Java programs, translated into 
Viper/Z3). However, these formalizations do not have a complete enough Java 
semantics to be able to analyze the bugs presented in this paper. In particu- 
lar, these formalizations seem to have no built-in support for integer overflow 
arithmetic, although it can be added manually. 


Verifying OpenJDK’s LinkedList using KeY 233 


Self-references 


Bian, J., Hiep, H.A.: Verifying OpenJDK’s LinkedList using KeY: Video (2019). 
https://doi.org/10.6084/m9.figshare.10033094.v2 

Hiep, H.A., Maathuis, O., Bian, J., de Boer, F.S., van Eekelen, M., 
de Gouw, S.: Verifying OpenJDK’s LinkedList using KeY: Proof Files (2019). 
https://doi.org/10.5281/zenodo.3517081 


References 


10. 


L1. 


12. 


13. 


. Ahrendt, W., Beckert, B., Bubel, R., Hàhnle, R., Schmitt, P.H., Ulbrich, M. 


(eds.): Deductive Software Verification: The KeY Book, LNCS, vol. 10001. Springer 
(2016). https://doi.org/10.1007/978-3-319-49812-6 


. Baumann, C., Beckert, B., Blasum, H., Bormer, T.: Lessons learned from 


microkernel verification—specification is the new bottleneck. In: SSV 2012: 
Systems Software Verification. EPTCS, vol. 102, pp. 18-32. OPA (2012). 
https: //doi.org/10.4204/EPTCS.102.4 


. Blom, S., Darabi, S., Huisman, M., Oortwijn, W.: The VerCors tool set: Verification 


of parallel and concurrent software. In: iFM 2017: Integrated Formal Methods. 
LNCS, vol. 10510, pp. 102-110. Springer (2017). https://doi.org/10.1007/978-3- 
319-66845-1_7 


. Cok, D.R.: OpenJML: Software verification for Java 7 using JML, Open- 


JDK, and Eclipse. In: F-IDE 2014: Workshop on Formal Integrated 
Development Environment. EPTCS, vol. 149, pp. 79-92. OPA (2014). 
https: //doi.org/10.4204/EPTCS.149.8 


. de Gouw, S., de Boer, F.S., Bubel, R., Hàhnle, R., Rot, J., Steinhófel, D.: Verifying 


OpenJDK's sort method for generic collections. J. Autom. Reasoning 62(1), 93-126 
(2019). https://doi.org/10.1007/s10817-017-9426-4 


. de Gouw, S., de Boer, F.S., Rot, J.: Proof Pearl: The KeY to correct and stable sort- 


ing. J. Autom. Reasoning 53(2), 129-139 (2014). https://doi.org/10.1007/s10817- 
013-9300-y 


. de Gouw, S., Rot, J., de Boer, F.S., Bubel, R., Hàhnle, R.: OpenJDK's 


java.utils.Collection.sort() is broken: The good, the bad and the worst case. In: 
CAV 2015: Computer Aided Verification. LNCS, vol. 9206, pp. 273-289. Springer 
(2015). https://doi.org/10.1007/978-3-319-21690-4. 16 

Huisman, M., Ahrendt, W., Bruns, D., Hentschel, M.: Formal specification 
with JML. Tech. rep. Karlsruher Institut für Technologie (KIT) (2014). 
https:/ /doi.org/10.5445/1R/1000041881 

Ieu Eauvidoum, disk noise: Twenty years of escaping the Java sandbox. Phrack 
Magazine (September 2018), http://www.phrack.org/papers/escaping.the. java. 
sandbox.html 

Klebanov, V., Müller, P., et al.: The 1st verified software competition: Experience 
report. In: FM 2011: Formal Methods. LNCS, vol. 6664, pp. 154-168. Springer 
(2011). https://doi.org/10.1007/978-3-642-21437-0. 14 

Klein, G., Nipkow, T.: A machine-checked model for a Java-like lan- 
guage, virtual machine, and compiler. ACM TOPLAS 28(4), 619-695 (2006). 
https:/ /doi.org/10.1145/1146809.1146811 


234 


14. 


15. 


16. 


17. 


18. 


H. A. Hie et al. 


Knüppel, A., Thüm, T., Pardylla, C., Schaefer, I.: Experience report on for- 
mally verifying parts of OpenJDK’s API with KeY. In: F-IDE 2018: Formal In- 
tegrated Development Environment. EPTCS, vol. 284, pp. 53-70. OPA (2018). 
https://doi.org/10.4204/EPTCS.284.5 

Knuth, D.E.: The art of computer programming, vol. 1. Addison-Wesley, 3rd edn. 
(1997) ISBN: 978-0-201-89683-4 

Leavens, G.T., Baker, A.L., Ruby, C.: JML: A notation for detailed design. In: 
Behavioral Specifications of Businesses and Systems, SECS, vol. 523, pp. 175-188. 
Springer (1999). https://doi.org/10.1007/978-1-4615-5229-1_12 

Nipkow, T., von Oheimb, D.: Javalight is type-safe—definitely. In: POPL 
1998: Principles of Programming Languages. pp. 161-170. ACM (1998). 
https:/ /doi.org/10.1145/268946.268960 

Polikarpova, N., Tschannen, J., Furia, C.A.: A fully verified container library. 
In: FM 2015: Formal Methods. LNCS, vol. 9109, pp. 414—434. Springer (2015). 
https:/ /doi.org/10.1007/978-3-319-19249-9 26 


Open Access This chapter is licensed under the terms of the Creative Commons 


Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 


which permits use, sharing, adaptation, distribution and reproduction in any medium 


or format, as long as you give appropriate credit to the original authors and the source, 


provide a link to the Creative Commons license and indicate if changes were made. 


'The images or other third party material in this chapter are included in the chapter's 


Creative Commons license, unless indicated otherwise in a credit line to the material. If 


material is not included in the chapter's Creative Commons license and your intended 


use is not permitted by statutory regulation or exceeds the permitted use, you will need 


to obtain permission directly from the copyright holder. 


® 


Check for 
updates 


Analysing installation scenarios 
of Debian packages * 


TACAS 
Artifact 


Evaluation 


Benedikt Becker! ©, Nicolas Jeannerod? ©, 2020 
Claude Marché! ©, Yann Régis-Gianas?? ©, 
Mihaela Sighireanu? (9, and Ralf Treinen? 


1 Université Paris-Saclay, Univ. Paris-Sud, CNRS, Inria, LRI, 91405, Orsay, France 
? Université de Paris, IRIF, CNRS, F-75013 Paris, France 
3 Inria, F-75013 Paris, France 


Abstract. The Debian distribution includes more than 28 thousand 
maintainer scripts, almost all of them are written in Posix shell. These 
scripts are executed with root privileges at installation, update, and re- 
moval of a package, which make them critical for system maintenance. 
While Debian policy provides guidance for package maintainers produc- 
ing the scripts, few tools exist to check the compliance of a script to it. We 
report on the application of a formal verification approach based on sym- 
bolic execution to find violations of some non-trivial properties required 
by Debian policy in maintainer scripts. We present our methodology 
and give an overview of our toolchain. We obtained promising results: 
our toolchain is effective in analysing a large set of Debian maintainer 
scripts and it pointed out over 150 policy violations that lead to reports 
(more than half already fixed) on the Debian Bug Tracking system. 


Keywords: Quality Assurance - Safety Properties - Debian - Software 
Package Installation - Shell Scripts - High-Level View of File Hierarchies 
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1 Introduction 


The Debian distribution is one of the oldest free software distributions, pro- 
viding today 60 000 binary packages built from more than 31 000 software source 
packages with an official support for nine different CPU architectures. It is one 
of the most used GNU/Linux distributions, and serves as the basis for some 
derived distributions like Ubuntu. 

A software package of Debian contains an archive of files to be placed on 
the target machine when installing the package. The package may come with a 
number of so-called maintainer scripts which are executed when installing, up- 
grading, or removing the package. A current version? of the Debian distribution 
contains 28814 maintainer scripts in 12592 different packages, 9 771 of which 
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are completely or partially written by hand. These scripts are used for tasks 
like cleaning up, configuration, and repairing mistakes introduced in older ver- 
sions of the distribution. Since they may have to perform any action on the tar- 
get machine, the scripts are almost exclusively written in some general-purpose 
scripting language that allows for invoking any Unix command. 


The whole installation process is orchestrated by dpkg, a Debian-specific tool, 
which executes the maintainer scripts of each package according to scenarios. 
The dpkg tool and the scripts require root privileges. For this reason, the failure 
of one of these scripts may lead to effects ranging from mildly annoying (like 
spurious warnings) to catastrophic (removal of files belonging to unrelated pack- 
ages, as already reported [39]). When an execution error of a maintainer script is 
detected, the dpkg tool attempts an error unwind, but the success of this oper- 
ation depends again on the correct behaviour of maintainer scripts. There is no 
general mechanism to simply undo the unwanted effects of a failed installation 
attempt, short of using a file system implementation providing for snapshots. 


The Debian policy [4] aims to normalise, in natural language, important tech- 
nical aspects of packages. Concerning the maintainer scripts we are interested in, 
it states that the standard shell interpreter is POSIX shell, with the consequence 
that 99% of all maintainer scripts are written in this language. The policy also 
sets down the control flow of the different stages of the package installation pro- 
cess, including attempts of error recovery, defines how dpkg invokes maintainer 
scripts, and states some requirements on the execution behaviour of scripts. One 
of these requirements is the idempotency of scripts. Most of these properties are 
until today checked on a very basic syntactic level (using tools like lintian [1]), 
by automated testing (like the piuparts suite [2]), or simply left until someone 
stumbles upon a bug and reports it to Debian. 


The goal of our study is to improve the quality of the installation of 
software packages in the Debian distribution using a formal and automated ap- 
proach. We focus on bug finding for three reasons. Firstly, a real Unix-like oper- 
ating system is obviously too complex to be described completely and accurately 
by some formal model. Besides, the formal correctness properties may be difficult 
to apprehend by Debian maintainers especially when they are expressed on an 
abstract model. Finally, when a bug is detected, even on a system abstraction, 
one can try to reproduce it on a real system and, if confirmed, report it to the 
authors. This has a real and immediate impact on the quality of the software 
and helps to promote the usage of formal methods to a community that often is 
rather sceptical towards methods and tools coming from academic research. 


The bugs in Debian maintainer scripts that we attempt to find may come at 
different levels: simple syntax errors (which may go unnoticed due to the unsafe 
design of the PosIx shell language), non-compliance with the requirements of 
the Debian policy, usage of unofficial or undocumented features, or failure of a 
script in a situation where it is supposed to succeed. 

The challenges are multiple: The PosiIx shell language is highly dynamic 
and recalcitrant to static analysis, both on a syntactic and semantic level. A 
Unix file system implementation contains many features that are difficult to 
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model, e.g., ownership, permissions, timestamps, symbolic links, and multiple 
hard links to regular files. There is an immense variety of Unix commands that 
may be invoked from scripts, all of which have to be modelled in order to be 
treated by our tools. To address properties of scripts required by the Debian 
policy, we need to capture the transformation done by the script on a file system 
hierarchy. For this, we need some kind of logic that is expressive enough, and 
still allows for automated reasoning methods. A particular challenge is checking 
the idempotency property for script execution because it requires relational rea- 
soning. For this, we encode the semantics of a script as a logic formula specifying 
the relation between the input and the output of the script, and we check that 
it is equivalent to its composition with itself. Finally, all these challenges have 
to be met at the scale of tens of thousands of scripts. 
The contributions of this work for this case study are: 


1. A translation of Debian maintainer scripts into a language with formal se- 
mantics, and a formalisation of properties required for the execution of these 
scripts by the Debian policy. 

2. A verification toolchain for maintainer scripts based on an existing symbolic 
execution engine [5,6] and a symbolic representation [26]. Some components 
of this toolchain have been published independently; we improve them to 
cope with this case study. The toolchain is free software available online [35]. 

3. A formal specification of the transformations done by an important set of 
PosIX commands [24] in feature tree constraints [26]. 

4. A number of bugs found by our method in recent versions of Debian packages. 


We start in the next section with an overview of our method illustrated on 
a concrete example. Section 3 explains in greater detail the elements of our 
toolchain, the particular challenges, the hypotheses that we could make for the 
specific Debian use case at hand, and the solution that we have found. Section 4 
presents the results we have found so far on the Debian packages, and the lessons 
learnt. We conclude in Section 5 by discussing additional outcomes of this study, 
the related and future work. 


2 Overview of the case study and analysis methodology 


2.1 Debian packages 


Three components of a Debian binary package play an important role in the 
installation process: the static content, i.e., the archive of files to be placed on 
the target machine when installing the package; the lists of dependencies and pre- 
dependencies, which tell us which packages can be assumed present at different 
moments; and the maintainer scripts, i.e., a possibly empty subset of four scripts 
called preinst, postinst, prerm, and postrm. We found (Section 4.2) that 99% 
of the maintainer scripts in Debian are written in PosIx shell [22]. 

Our running example is the binary package rancid-cgi [31]. It comes with 
only two maintainer scripts: preinst and postinst. The preinst script is in- 
cluded in Fig. 1. If the symbolic link /etc/rancid/lg.conf exists then it is 
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lif [ -h /etc/rancid/lg.conf ]; then 
rm /etc/rancid/lg.conf 


if [ -e /etc/rancid/apache.conf ]; then 
rm /etc/rancid/apache.conf 


Fig. 1. preinst script of the rancid-cgi package 


removed; if the file /etc/rancid/apache.conf exists, no matter its type, it is 
also removed. Both removal operations use the POSIX command rm which, with- 
out options, cannot remove directories. Hence, if /etc/rancid/apache.conf is 
a directory, this script fails while trying to remove it. 

We did a statistical analysis of maintainer scripts in Debian to help us de- 
sign our intermediate language, see Section 4.2 for details. We found that, for 
instance, most variables in these scripts can be expanded statically and hence are 
used like constants; most while loops can be translated into for loops; recursive 
functions are not used at all; redirections are almost always used to discard the 
standard output of commands. 


2.2 Managing package installation 


'The maintainer scripts are invoked by the dpkg utility when installing, removing 
or upgrading packages. Roughly speaking, for installation dpkg calls the preinst 
before the package static content is unpacked, and calls the postinst afterwards. 
For deinstallation, it calls the prerm before the static content is removed and calls 
the postrm afterwards. The precise sequence of script invocations and the actual 
script parameters are defined by informal flowcharts in the Debian policy [4, 
Appendix 9]. Fig. 2 shows the flowchart for the package installation. dpkg may 
be asked to: install a package that was not previously installed (Fig. 2), install a 
package that was previously removed but not purged, upgrade a package, remove 
a package, purge a package previously removed, remove and purge a package. 
'These tasks include 39 possible execution paths, 4 of them presented in Fig. 2. 

The Debian policy contains [4, Chapters 6 and 10] several requirements on 
maintainer scripts. This case study targets checking the requirements regarding 
the execution of scripts, and considers out of scope some other kinds of re- 
quirements, e.g., the permissions of script files. The requirements of interest are 
checked by different tools of our toolchain presented in Section 3. For example, 
the different ways to invoke a maintainer script are handled by the analysis of 
scenarios (Section 3.5) calling the scripts. Different requirements on the usage 
of the shell language are checked by the syntactic analysis (Section 3.1), like 
the usage of -e mode or of authorised shell features that are optional in the 
Posix standard. Some of the usage requirements can be detected by a semantic 
analysis; this is done in our toolchain by a translation into a formally defined 
language, called CoLiS (Section 3.1). Finally, requirements concerning the be- 
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FAILED 
(preinst install] — — — 35» postrm abort-install {FAILED 


OK OK 


Files are unpacked 


postinst configure .FAILED 


OK 


"Installed" "Failed- Config" "Not Installed" "Half Installed" 


Y 
Exit with error mesage 


Fig. 2. Debian flowchart for installing a package [4, Appendix 9] (The states represent 
calls to maintainer scripts with their arguments and the status returned by dpkg at 
the end of the process is in bold.) 


haviour of scripts include the usage of exit codes and the idempotency of scripts. 
'The last property is difficult to formalise since it refers to possible unforeseen 
failures (see discussion in Section 4.4). Checking behavioural properties requires 
to reason about their semantics, which is done by a symbolic execution in our 
toolchain (Section 3.4). We also check some requirements that are simply com- 
mon sense and that are not stated in the policy, e.g., invoking Unix commands 
with correct options. This is done by the semantic analysis (Section 3.1). 


2.3 Principles and workflow of the analysis method 


Our goal is to check the above properties of maintainer scripts in a formal way, 
by analysing each script and the composition of scripts in the execution paths 
exhibited by the flowcharts of dpkg. We call scenario either an execution path 
of dpkg, a single execution of a script, or a double execution of a script with the 
same parameters (to check idempotency); refer to Section 3.5 for more details. 

'The analysis should consider a variety of states for the system on which the 
execution takes place. Yet we assume the following hypotheses: the scripts are 
executed in a root process without concurrency with other user or root processes, 
the static content of the package is successfully unpacked, the dependencies de- 
fined by the package are present (fact checked by dpkg), and the /bin/sh com- 
mand implements the standard Posix.1-2017 Shell Command Language with 
the additional features described in the Debian policy [4, Chapter 10]. 

'The components of our toolchain for the analysis of a scenario are summarised 
on Fig. 3 and detailed in Section 3. Given a package and one scenario, the scenario 
player extracts the static content and the maintainer scripts, prepares the initial 
symbolic state of the scenario, symbolically executes the steps of the scenario to 
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Scenario Player 


Static 
Contents 


Symbolic 
Relations 


Symbolic 
Engine 


E 


Fig. 3. Toolchain for analysis of a scenario on a given package (see Section 2.3) 


compute a symbolic relation between the input and the output states of the file 
system for each outcome of the scenario, and produces a diagnosis. 


2.4 Presentation of results 


'The results computed by the sce- 
nario player are presented in a set 
of web pages, one per scenario, 
and a summary page for the pack- 
age [34]. Each scenario may have 
several computed exit codes; for an 
error code, the associated symbolic 
relation is translated automatically 
into a diagnosis message. 

For example, consider the sim- 
ple scenario of a call to the script 
preinst given in Fig. 1. The result 
web page includes the diagram in 


N {etc} 
etc etc 
^" {rancid} 
rancid | rancid 
.. ^" (1g:conf] 
1g. conf / N apache. cont / N 1g-coné 
(symlink) (dir) n 


Fig.4. Example of diagnosis: error case for 
preinst call in the package rancid-cgi 


Fig. 4, which is obtained by the interpretation of the symbolic relation com- 
puted by the scenario player for the error exit code. The diagram represents an 
abstraction of the initial file system on the left, an abstraction of the file system 
at the end of the script’s execution on the right, and the relation between these 
abstractions (dotted lines). In this diagram, a plain edge represents the parent 
relation in the file hierarchy. A dotted edge describes a similarity relation, e.g., 
the trees rooted at /etc coincide except on the child named rancid. | denotes 
the absence of a node. Finally, a leaf can be annotated by a property, e.g., the an- 
notation dir rooted at /etc/rancid/apache.conf. The diagram shows that the 
preinst script leads to an error state when the file /etc/rancid/apache.conf 
is a directory since the rm command cannot remove directories. 
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Finally, another set of generated web pages provides statistics on the coverage 
and the errors found for the full set of scenarios of the Debian distribution. 


3 Design and implementation of the tool chain 


'The toolchain, as described in Fig. 3, hinges on a symbolic execution engine 
which computes the overall effect of a script on the file system as a symbolic 
relation between the input and the output file system. This section details this 
execution engine, which is composed of (i) a front-end that parses the script 
and translates it into a script in a formally defined intermediate language called 
CoLiS, and (ii) a back-end that symbolically executes the CoLiS scripts to get, for 
each outcome of the script, the relation between input and output file systems 
encoded by a tree constraint. 


3.1 Front-end 


Shell parser. The syntax of the Posix shell language is unconventional in many 
aspects. For this reason, the implementation of a parser for Posix shell cannot 
simply reuse the standard techniques solely based on code generators. Most of 
the shell implementations falls back to manually written character-level parsers, 
which are difficult to maintain and to trust. morbig [30] is a parser that tries to 
use code generators as much as possible to keep the parser implementation at a 
high level of abstraction, simplifying maintenance and improving our ability to 
check if it complies with the POSIX standard. 


The CoLiS language. It was first presented in 2017 [23]. Its design aims to avoid 
some pitfalls of the shell, and to make explicit the dangerous constructions we 
cannot eliminate. It has a clear syntax and a formally defined semantics. We 
provide an automated and direct translation from PosIx shell. The correctness 
of the translation from shell to CoLiS cannot be proven formally but must be 
trusted based on manual review of translations and tests. 

For this case study, we improved the language proposed formerly [23] to 
increase the number of analysed Debian maintainer scripts. First, we added a 
number of constructs to the language. Second, we provide a formal semantics for 
the new constructs and we align the previous semantics [23] to the one of the 
POSIX shell for a few other constructs. These changes and a complete description 
of the current CoLiS language are described in a technical report [6]. Fig. 5 
shows the CoLiS version of the preinst script of the rancid-cgi package, shown 
previously in Fig. 1. Notice the syntax for string arguments and for lists of 
arguments that requires mandatory usage of delimiters. Generally speaking, the 
syntax of CoLiS is designed so as to remove potential ambiguities [6]. 

'The toolchain for analysing CoLiS scripts is designed with formal verification 
in mind: the syntax, semantics, and interpreters of CoLiS are implemented using 
the Why3 environment [7] for formal verification. More precisely, the syntax 
of CoLiS is defined abstractly (as abstract syntax trees, AST for short) by an 
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;||if test [ ’?-h’; ’/etc/rancid/lg.conf’ ] then 
rm [ ?/etc/rancid/lg.conf’ ] 

j| fi 

if test [ ’-e’; ’/etc/rancid/apache.conf’ ] then 
rm [ ?/etc/rancid/apache.conf? ] 

fi 


Fig. 5. preinst script of the rancid-cgi package in CoLiS 


algebraic datatype in Why3. Then CoLiS semantics is defined by a set of inductive 
predicates [6] that encodes a chiefly standard, big-step operational semantics. 
The semantic rules cover the contents of variables and input/output buffers used 
during the evaluation of a CoLiS script, but they do not specify the contents of 
the file system and the behaviour of PosIx commands. The judgements and rules 
are parameterised by bounds on the number of loop iterations and the number 
of (recursively) nested function calls to allow for formalising the correctness of 
the symbolic interpreter. The bounds are either a non-negative integer, or oo for 
unbounded execution, and keep constant throughout the evaluation of a CoLiS 
instruction. We refer to [6] for the details. 


A concrete interpreter for the CoLiS language is implemented in Why3. Its 
formal specifications (preconditions and post-conditions) state the soundness of 
the interpreter, i.e., that any result corresponds to the formal semantics with 
unbounded number of loop iterations and unbounded nested function calls. The 
specifications are checked using automated theorem provers [23]. 


Translation from shell to CoLiS. This is done automatically, but it is not formally 
proven. Indeed, a formal semantics of shell was missing until very recently [21]. 
For the control flow constructs, the AST of the shell script is translated into the 
AST of CoLiS. For the strings (words in shell), the translation generates either a 
string CoLiS expression or a list of CoLiS expressions depending on the content of 
the shell string. This translation makes explicit the string evaluation in shell, in 
particular the implicit string splitting. At the present time, the translator rejects 
23% of shell scripts because it does not cover the full constructs of the shell, e.g., 
usage of globs, variables with parameters, and advanced uses of redirections. 


The conformance of the CoLiS script with the original shell script is not 
proven formally but tested by manual review and some automatic tests. For the 
latter, we developed a tool that automatically compares the results of the CoLiS 
interpreter on the CoLiS script with the results of the Debian default shell (dash) 
on the original shell script. This tool uses a test suite of shell scripts built to 
cover the whole constructs of the CoLiS language. The test suite allowed us to fix 
the translator and the formal semantics of CoLiS and, as an additional outcome, 
it revealed a lack of conformance between the Debian default shell and Posrx?. 


5 https: / /www.mail-archive.com/dash@vger.kernel.org/msg01683.html 
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ib/ \ share " 
t3: 
dir dir 3 etc/ Nas: 
d rancid// 1ib/ Nhare 
bin/ Nes $ dir dir 
dir 3 apache. conf / \ g. conf 


1ib/ \ share reg symlink 


dir dir 
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ta: 


Fig. 6. Examples of feature trees showing directories (t1), sub-directories (12), a regular 
file and a symbolic link (t3). 


x x x 
| | 
f x g L~F Y F "E 
y als y 
Fig. 7. Basic constraints, from left to right: a feature, a regular file node, a directory 
node, a tree similarity, a feature absence, a maybe 


3.2 Feature trees and constraints 


We employ models and logics to describe transformations of UNIX file systems. 
Feature trees [32,3,33] turn out to be suitable models for this case study. We have 
proposed a logic suitable to express file system transformations by extending 
previously existing logics. For the sake of space, we provide a concise overview 
of the model and logic used in this case study. 


Feature trees. The models we consider here are trees with features (taken from F, 
an infinite set of legal file names) on the edges, the dir kind on the nodes and 
any kind (dir, reg or symlink) on the leaves. Examples are given in Fig. 6. 


Constraints. To specify properties of feature tree models, we modify our first 
order logic [26] to suit this case study's needs. For the sake of presentation, we 
use a graphical representation of quantifier-free conjunctive clauses of this logic. 
See the technical report [24] for a detailed presentation. 

'The core basic constraints are presented in Fig. 7. The feature constraint 
expresses that y is a subtree of x accessible from the root of x via feature f. 
The kind constraints express that the root of a tree has the given kind (dir, reg 
or symlink). The similarity constraint expresses that r and y have the same 
children with the same names except for the children whose names are in F, a 
finite set of features, where they may differ. 
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For performance reasons, we added two more ys M{binetc} g 
constraints; these do not increase the expres- DN 
: : : bin?! "etc? USE 
sive power but help to prevent combinatorial ex- 2 
plosion of formulas. The absence constraint ex- v w 7 
presses that either x is not a directory or x does fei) (reg) rete 
not have a feature f at its root. The maybe con- L 


straint expresses that either x is not a directory, 
or it does not have a feature f at its root, or it Fig. 8. A conjunctive clause 
has one that leads to y. 

A model of a formula is a valuation that maps variables to feature trees. 
For instance, consider the valuation that associates tı to x, t9 to y and t3 to z, 
where t1, t2 and t3 are the trees defined in Fig. 6; it satisfies the formula in Fig. 8 


Satisfiability. We designed a set of transformation rules [26] that turns any X4- 
formula into an irreducible form that is either false or a satisfiable formula. 
This is convenient in our setting because we can detect unsatisfiable formulas as 
soon as possible and keep the irreducible form instead of the original formula, 
speeding up further computations. Our toolchain includes an implementation 
of this system, using an efficient representation of irreducible X,-formulas as 
trees themselves. Finally, the system of rules is also extended to a quantifier 
elimination procedure, showing that the whole first-order logic is decidable. 


3.3 Specifications of UNIX commands 


The specification of the UNIX commands uses our feature tree logic to express 
their effect on the file system. The specification formalises the description given 
in natural language in the PosIx standard |22, Chapter Utilities] and, for some 
commands, in GNU manual pages. We only specified (most of) the UNIX com- 
mands called by the maintainer scripts. 

The full specification is available in a separate technical report [24]. We 
present here its main ingredients. A UNIX command has the form: * cmd options 
paths", where “cmd” is a command name, “options” is a list of options, and 
^paths" is one or more absolute or relative paths (i.e., sequence of file names 
and symbols “.” and“. .”). For each combination of command name and option, 
we provide a list of formulas specifying the success and failure cases. A success or 
failure case formula has two free variables r and r’, which represent the root of 
the file system before and after the command execution. For some combinations 
of command names and options, the specification is not provided, but computed 
by the symbolic execution of a CoLiS script. This script captures the command 
behaviour by calling other (primitive) commands. 


Path resolution. An important ingredient in command specification is the con- 
straint encoding the resolution of a path in the file system. For this, we define 
a predicate resolve(r, cwd, p, z) stating that “when the root of the file system 
is r and the current working directory is the sequence of features cwd, the path 
p resolves and goes to variable z". The constraint defining this predicate is a X1 
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Fig. 9. Specification of success case for rm /etc/rancid/lg.conf 
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Fig. 10. Specification of error cases of rm /etc/rancid/lg.conf: explicit cases on the 
left, compact specification on the right 


conjunction of basic constraints; it does not deal with symbolic link files on the 
path. For example, the constraint resolve(r, cwd, /etc/rancid/lg.conf,z) is 
represented by the path starting from r and ending in z in Fig. 9. 

For some commands, a failure of path resolution may cause the failure of 
the command. To specify these failure cases, we have to use the negation of the 
predicate resolve, which generates a number of clauses which is linear in the 
length of the resolved path. Fig. 10 shows, in the three left-most constraints, 
the error cases for the resolution of the path to /etc/rancid/lg.conf. Because 
the internal representation of formulas keeps only conjunctive clauses, this may 
produce a state explosion of constraints when the command uses several paths. 
'To obtain a compact internal representation of these error cases, we employ the 
maybe shorthand, as shown on the right of Fig. 10. 

Let us consider the command rm /etc/rancid/lg.conf. Its specifica- 
tion includes one success case, given on Fig. 9: the resolution of the path 
/etc/rancid/lg.conf succeeded in the initial file system denoted by r, and 
the resulting file system, denoted by r’ is similar to r except for the absence 
of the feature 1g.conf. The specification also includes one error case given on 
Fig. 10, where the path cannot be resolved to a regular path, and therefore the 
initial and final file systems are the same. 

It is important to notice that specifications of commands are parameterised 
by their path(s) argument (s): for each concrete value of such paths, an appropri- 
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ate constraint is produced. This fact is essential for using our symbolic engine, 
because the variables of a constraint denote nodes of the file system, but there 
is no notion of variable denoting file names or paths. 


3.4 Analysis by symbolic execution 


With a similar approach as for the concrete interpreter (Section 3.1), we designed 
and implemented a symbolic interpreter for the CoLiS language in Why3. Guided 
by a proof-of-concept symbolic interpreter for a simple IMP language [5], the 
main design choices for the symbolic interpreter for CoLiS are: 


— Variables are not interpreted abstractly: when executing an installation 
script, the concrete values of the variables are known. On the other hand, 
the state of the file system is not known precisely, and it is represented 
symbolically using a feature tree constraint. 

— The symbolic engine is generic with respect to the utilities: their specifica- 
tions in terms of symbolic input/output relations are taken as parameters. 

— The number of loop iterations and the number of (recursively) nested func- 
tion calls [6]) is bounded a priori, the bound is given by a global parameter 
set at the interpreter call. 


'The Why3 code for the symbolic interpreter is annotated with post-conditions to 
express that it computes an over-approximation [5] of the concrete states that are 
reachable without exceeding the given bound on loop iterations. This property 
is formally proven using automated provers. The OCaml code is automatically 
extracted from Why3, and provides an executable symbolic interpreter with 
strong guarantees of soundness with respect to the concrete formal semantics. 

Notice that our symbolic engine neither supports parallel executions, nor file 
permissions or file timestamps. This is another source of over-approximation, 
but also under-approximation, meaning that our approach can miss bugs whose 
triggering relies on the former features. 

'The symbolic interpreter provides a symbolic semantics for the given script: 
given an initial symbolic state that represents the possible initial shape of the file 
system, it returns a triple of sets of symbolic input/output relations, respectively 
for normal result, error result (corresponding to non-zero exit code) and result 
when a loop limit is reached. Error results are unexpected for Debian maintainer 
scripts, and these cases have to be inspected manually. To help this inspection, a 
visualisation of symbolic relations was designed, as already described in Fig. 4. 


3.5 Scenarios 


So far, we have presented how we analyse individual maintainer scripts. In reality, 
the Debian policy specifies in natural language in which order and with which 
arguments these scripts are invoked during package installation, upgrade, or 
removal (see, for instance, Fig. 2). We have specified these scenarios in a loop- 
free custom language. These scenarios define what happens after the success or 
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the failure of a script execution. They also specify when the static content is 
unpacked. Furthermore, our toolchain allows to define the assumptions that can 
be made on an initial filesystem before executing a scenario, for instance the 
File System Hierarchy Standard [38]. Our toolchain reports on packages that 
may remain in an unexpected state after the execution of one of these scenarios. 

For instance, the installation scenario of the package rancid-cgi may leave 
that package in the state not-installed, which is reported by our toolchain using 
the diagram in Fig. 4. 


4 Results and impact 


4.1 Coverage of the case study 


The tools used and the datasets analysed during the current study are available 
in the Zenodo repository [36]. 

We execute the analysis on a machine equipped with 40 hyperthreaded Intel 
Xeon CPU @ 2.20GHz, and 750GB of RAM. To obtain a reasonable execution 
time, we limit the processing of one script to 60 seconds and 8GB of RAM. 
The time limit might seem low, but the experience shows that the few scripts 
(in 30 packages) that exceed this limit actually require hours of processing be- 
cause they make a heavy use of dpkg-maintscript-helper. On our corpus of 
12592 packages with 28 814 scripts, the analysis runs in about half an hour. 

All of those scripts that are syntactically correct with respect to the POSIx 
standard (99.996) are parsed successfully by our parser. The translation of the 
parsed scripts into our intermediary language CoLiS succeeds for 7796 of them; 
the translation fails mainly because of the use of globs, variables with parameters 
and advanced uses of redirections. 

Our toolchain then attempts to run 113328 scenarios (12 592 packages with 
scripts, 9 scenarios per package). Out of those, 45 456 scenarios (40%) are run 
completely and 13149 (1296) partially. This is because scenarios have several 
branches and although a branch might encounter failure, we try to get some 
information on execution of other branches. For the same reason, one scenario 
might encounter several failures. In total, we encounter 67 873 failures. The ori- 
gins of failures are multiple, but the two main ones are (i) trying to execute 
a scenario that includes a script that we cannot convert (2896 of failures), or 
(ii) the scripts might use commands unsupported by our tools, or unsupported 
features of supported commands (7196 of failures). 

Among the scenarios that we manage to execute at least partially, 19 reach 
an unexpected end state. These are potential bugs. We have examined them 
manually to remove false positives due to approximations done by our method- 
ology or the toolchain. We discuss in Section 4.3 the main classes of true bugs 
revealed by this process. 


4.2 Corpus mining 


The latest version of the Debian sid distribution on which we ran our tools dates 
from October 6, 2019. It contains 60000 packages, 12 592 of which contain at 
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Table 1. Bugs found between 2016 and 2019 in Debian sid distributions 


Bugs Closed Detected by Reports Examples 

95 56 parser [9] not using -e mode 
6 4 parser & manual |15] unsafe or non-POosIX constructs 

34 24 corpus mining [8,10] wrong options, mixed redirections 
9 7 translation [11] wrong test expressions 
5 2 symbolic execution [13,17,15] try to remove a directory with rm 
3 3 formalisation [12] bug in dpkg-maintscript-helper 

152 96 


least one maintainer script, which leads to 28 814 scripts. In total, these scripts 
contain 442 364 source lines of code, 15 lines on average, and up to 1138 for the 
largest script. Among them we find 220 bash scripts, 2 dash scripts, 14 perl 
scripts, and one ELF executable — the rest are POSIX shell scripts. 

In the process of designing our tools, and in order to validate our hypotheses, 
we ran statistical analysis on this corpus of scripts. The construction of our tool 
for statistical analysis is described in a technical report [25] where we also detail 
a few of our findings. To summarise, analysing the corpus revealed that: 


— Most variables in scripts were used as constants: only 3008 scripts contain 
variables whose value actually changes. 

— There are no recursive functions in the whole corpus. 

— There are 2300 scripts that include a while loop. 93% of the while loops 
occur in a pipe reading the output of dpkg -L and are an idiosyncrasy that 
is proper to some shell languages. They can be translated to “foreach” loops 
in a properly typed language. 

— The huge majority of redirections are used to hide the standard output or 
merge it into the error output. 


This analysis had an important impact on the project by guiding the design 
choices of CoLiS, which Unix commands we should specify and in which or- 
der, etc. This also helped us to discover a few bugs, e.g., scripts invoking Unix 
commands with invalid options. 


4.3 Bugs found 


We ran our toolchain on several snapshots of the Debian sid distribution taken 
between 2016 and 2019, the latest one being October 6, 2019. We reported over 
this period a total of 152 bugs to the Debian Bug Tracking System [37]. Some of 
them have immediately been confirmed by the package maintainer (for instance, 
[16]), and 96 of them have already been resolved. 

Table 1 summarises the main categories of bugs we reported. Simple lexical 
analysis already detects 95 violations of the Debian Policy, for instance scripts 
that do not specify the interpreter to be used, or that do not use the -e mode [9]. 
The shell parser (Section 3.1) detects 3 scripts that use shell constructs not 
allowed by the POsIx standard, or in a context where the POSIX standard states 
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that the behaviour is undefined [15]. There are also 3 miscellaneous bugs, like 
using unsafe shell constructs. The mining tool (Section 4.2) detects 5 scripts that 
invoke Unix commands with wrong options and 29 scripts that mix up redirection 
of standard-output and standard-error. The translation from the shell to the 
CoLiS language (Section 3.1) detects 9 scripts with wrong test expressions [11]. 
These may stay unnoticed during superficial testing since the shell confuses, when 
evaluating the condition of an if-then-else, an error exception with the Boolean 
value False. Inspection of the symbolic semantics extracted by the symbolic 
execution (Section 3.4) finds 5 scripts with semantic errors. Among these is the 
bug [16] of the package rancid-cgi already explained in Section 2.4. During the 
formalisation of Debian tools (see Section 3.3), we found 3 bugs. These include in 
particular a bug [12] in the dpkg-maintscript-helper command which is used 
10 306 times in our corpus of maintainer scripts, and was fixed in the meantime. 


4.4 Lessons learnt 


One basic problem when trying to analyse maintainer scripts is to understand 
precisely the meaning of the policy document. For instance, one of the more 
intriguing requirements is that maintainer scripts have to be idempotent (Sec- 
tion 6.2 in [4]). While it is common knowledge that a mathematical function f 
is idempotent when f(f(r)) = f(x) for any x, the meaning is much less clear 
in the context of Debian maintainer scripts as the policy goes on to explain “If 
the first call failed, or aborted half way through for some reason, the second 
call should merely do the things that were left undone the first time, if any, and 
exit with a success status if everything is OK.” We suppose that this refers to 
causes of error external to the script itself (power failure, full disk, etc.), and 
that there might be an intervention by the system administrator between the 
two invocations. Since we cannot even explain in natural language what precisely 
that means, let alone formalise it, we decided to model at the moment only a 
rough under-approximation of that property that only compares executions by 
their exit code. This allowed us to detect a bug [14]. 


We found that identifying bugs in maintainer scripts always requires human 
examination. Automated tools allow to point out potential problems in a large 
corpus, but deciding whether such a problem actually deserves a bug report, 
and of what severity level, requires some experience with the Debian processes. 
This is most visible with semantic bugs in scripts, since an error exit code does 
not imply that there is a bug. Indeed, if a script detects a situation it cannot 
handle then it must signal an error and produce a useful error message. Deciding 
whether a detected error case is justified or accidental requires human judgement. 

Filing bug reports demands some caution, and observance of rules and com- 
mon practices in the community. For instance, the Debian Developers Refer- 
ence [18] requires approval by the community before so-called mass bug filing. 
Consequently, we always sought for advice before sending batches of bugs, either 
on the Debian developers mailing list, or during Debian conferences. 
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5 Conclusion 


The corpus of Debian maintainer scripts is an interesting case study for analysis 
due to its size, the challenging features of the scripting language, and the re- 
lational properties it requires to analyse. The results are very promising. First, 
we reported 152 bugs [37] to the Debian Bug Tracking system, 96 of which have 
already been resolved by Debian maintainers. Second, the toolchain performs 
the analysis of a package in seconds and of the full distribution in less than a 
hour, which makes it fit for integration in the workflow of Debian maintainers 
or for quality assurance at the the level of the whole distribution. Integration of 
our toolchain in the lintian tool will not be possible since it would add a lot of 
external dependencies to that tool, and since the reports generated by our tool 
still require human evaluation (see Section 4.4). 

This study had several additional outcomes. The toolchain includes tools for 
parsing and light static analysis of shell scripts [30], an engine for the symbolic 
execution of imperative languages based on first-order logics representation of 
program configurations [5], and an efficient decision procedure for feature tree 
logics. We also provide a formal specification of POSIX commands used in Debian 
scripts in terms of a first-order logic [24]. 

We are not aware of a project dealing with this kind of problem or obtaining 
comparable results. To our knowledge, the only existing attempt to analyse a 
complete corpus of package maintainer scripts was done in the context of the 
Mancoosi project [19]. In this work, the analysis, mainly syntactic, resulted in a 
set of building blocks used in maintainer scripts that may be used in a DSL. In a 
series of papers [20,28,29], Ntzik et al. consider the formal reasoning on the Posix 
scripts manipulating the file system based on (concurrent) separation logic. Not 
only do they employ a different logic (a second-order logic), but they also focus 
on (manual) proof techniques for correctness and not on automatic techniques for 
finding bugs. Moreover, they consider general scripts and properties that are not 
relational (like idempotency). There have been few attempts to formalise the 
shell. Greenberg [21] recently offers an executable formal semantics of PosIx 
shell that will serve as a foundation for shell analysis tools. Abash [27] contains 
a formalisation of parts of the bash language and an abstract interpretation tool 
for the analysis of arguments passed by scripts to Unix commands; this work 
focused on identifying security vulnerabilities. 

'The successful outcome of this case study revealed new challenges that we 
aim to address in future work. In order to increase the coverage of our analysis 
and the acceptance by Debian maintainers, the translation from shell should 
cover more features, additional Unix commands should be formally specified, 
and the model should capture more features of the file system, e.g., permissions, 
or symbolic links. The efficiency of the analysis can still be improved by using à 
more compact representation of disjunctive constraints in feature tree logics or by 
exploiting the genericity of the symbolic execution engine to include other logic 
based symbolic representations that may be more efficient and precise. Finally, 
we want to use the computed constraints on scenarios to check new properties 
of scripts like equivalence of behaviours. 


Analysing installation scenarios of Debian packages 251 


References 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


1T. 


18. 


Lintian. https://lintian.debian.org 


. Piuparts. https:/ /piuparts.debian.org/ 


Ait-Kaci, H., Podelski, A., Smolka, G.: A feature-based constraint system for logic 
programming with entailment. Theor. Comput. Sci. 122(1-2), 263-283 (Jan 1994) 
Allbery, R., Whitton, S.: Debian policy manual (Oct 2019), https:/ /www.debian. 
org/doc/debian-policy / 

Becker, B., Marché, C.: Ghost Code in Action: Automated Verification of a 
Symbolic Interpreter. In: Chakraborty, S., A.Navas, J. (eds.) Verified Software: 
Tools, Techniques and Experiments. Lecture Notes in Computer Science (2019), 
https:/ /hal.inria.fr/hal-02276257 

Becker, B., Marché, C., Jeannerod, N., Treinen, R.: Revision 2 of CoLiS language: 
formal syntax, semantics, concrete and symbolic interpreters. Technical report, 
HAL Archives Ouvertes (Oct 2019), https://hal.inria.fr/hal-02321743 

Bobot, F., Filliátre, J.C., Marché, C., Paskevich, A.: Let's verify this with Why3. 
International Journal on Software Tools for Technology Transfer (STTT) 17(6), 
709-727 (2015). https://doi.org/10.1007/s10009-014-0314-5, http:/ /hal.inria.fr/ 
hal-00967132/en, see also http:/ /toccata.lri.fr/gallery /fm2012comp.en.html 
Debian Bug Tracker: dibbler-server: postinst contains invalid command. Debian 
Bug Reports 841934 (Oct 2016), https:/ /bugs.debian.org/cgi- bin/bugreport.cgi? 
bug=841934 

Debian Bug Tracker: authbind: maintainer script(s) not using strict mode. De- 
bian Bug Report 866249 (Jun 2017), https: //bugs.debian.org/cgi-bin/bugreport. 
cgi? bug=866249 

Debian Bug Tracker: dict-freedict-all: postinst script has a wrong redirection. De- 
bian Bug Report 908189 (Sep 2018), https: //bugs.debian.org/cgi- bin/bugreport. 
cgi? bug=908189 

Debian Bug Tracker: python3-neutron-fwaas-dashboard: incorrect test in postrm. 
Debian Bug Report 900493 (May 2018), https://bugs.debian.org/cgi-bin/ 
bugreport.cgi? bug=900493 

Debian Bug Tracker: [dpkg-maintscript-helper| bug in finish. dir to symlink. De- 
bian Bug Report 922799 (Feb 2019), https:/ /bugs.debian.org/cgi- bin/bugreport. 
cgi?bug—922799 

Debian Bug Tracker: ndiswrapper: when "postrm purge" fails it may have deleted 
some config files. Debian Bug Report 942392 (Oct 2019), https:/ /bugs.debian.org/ 
cgi- bin/bugreport.cgi?bug-— 942392 

Debian Bug Tracker: oz: non-idempotent postrm script. Debian Bug Report 942395 
(Oct 2019), https:/ /bugs.debian.org/cgi- bin/bugreport.cgi?bug=942395 

Debian Bug Tracker: preinst script not posix compliant. Debian Bug Report 925006 
(Mar 2019), https: //bugs.debian.org/cgi-bin /bugreport.cgi? bug=925006 

Debian Bug Tracker: rancid-cgi: preinst may fail and not rollback a change. De- 
bian Bug Report 942388 (Oct 2019), https: //bugs.debian.org/cgi-bin/bugreport. 
cgi? bug=942388 

Debian Bug Tracker: sgml-base: preinst may fail *silently*. Debian Bug Report 
929706 (May 2019), https: //bugs.debian.org/cgi- bin/bugreport.cgi?bug=929706 
Developer’s Reference Team: Debian developers reference (Oct 2019), https:// 
www.debian.org/doc/manuals/developers-reference/ 


252 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


B. Becker et al. 


Di Cosmo, R., Di Ruscio, D., Pelliccione, P., Pierantonio, A., Zacchi- 
roli, S.: Supporting software evolution in component-based FOSS sys- 
tems. Science of Computer Programming  76(12), 1144-1160 (2011). 
https: //doi.org/10.1016/j.scico.2010.11.001 

Gardner, P., Ntzik, G., Wright, A.: Local reasoning for the POSIX file system. 
In: European Symposium On Programming. Lecture Notes in Computer Science, 
vol. 8410, pp. 169-188. Springer (2014). https: //doi.org/10.1007/978-3-642-54833- 
8 10 

Greenberg, M., Blatt, A.J.: Executable formal semantics for the POSIX shell. 
CoRR abs/1907.05308 (2019), http://arxiv.org/abs/1907.05308 

IEEE, The Open Group: The open group base specifications issue 7. http://pubs. 
opengroup.org/onlinepubs/9699919799/ (2018) 

Jeannerod, N., Marché, C., Treinen, R.: A Formally Verified Interpreter for a Shell- 
like Programming Language. In: 9th Working Conference on Verified Software: 
Theories, Tools, and Experiments. Lecture Notes in Computer Science, vol. 10712 
(2017), https: //hal.archives-ouvertes. fr /hal-01534747 

Jeannerod, N., Régis-Gianas, Y., Marché, C., Sighireanu, M., Treinen, R.: Speci- 
fication of UNIX utilities. Technical report, HAL Archives Ouvertes (Oct 2019), 
https:/ /hal.inria.fr/hal-02321691 

Jeannerod, N., Régis-Gianas, Y., Treinen, R.: Having fun with 31.521 shell 
scripts. Tech. rep., HAL Archives Ouvertes (2017), https:/ /hal.archives-ouvertes. 
fr/hal-01513750 

Jeannerod, N., Treinen, R.: Deciding the first-order theory of an algebra of 
feature trees with updates. In: Galmiche, D., Schulz, S., Sebastiani, R. (eds.) 
9th International Joint Conference on Automated Reasoning. Lecture Notes in 
Computer Science, vol. 10900, pp. 439-454. Springer, Oxford, UK (Jul 2018), 
https:/ /hal.archives-ouvertes.fr/hal-01807474 

Mazurak, K., Zdancewic, S.: ABASH: finding bugs in bash scripts. In: Workshop 
on Programming Languages and Analysis for Security. pp. 105-114 (2007) 

Ntzik, G., Gardner, P.: Reasoning about the POSIX file system: local update and 
global pathnames. In: Object-Oriented Programming, Systems, Languages and Ap- 
plications. pp. 201-220. ACM (2015). https:/ /doi.org/10.1145/2814270.2814306 
Ntzik, G., da Rocha Pinto, P., Sutherland, J., Gardner, P.: A concurrent specifi- 
cation of POSIX file systems. In: European Conference on Object-Oriented Pro- 
gramming. LIPIcs, vol. 109, pp. 4:1—4:28. Schloss Dagstuhl - Leibniz-Zentrum fuer 
Informatik (2018). https:/ /doi.org/10.4230/LIPIcs.ECOOP.2018.4 

Régis-Gianas, Y., Jeannerod, N., Treinen, R.: Morbig: A static parser for POSIX 
shell. In: Pearce, D., Mayerhofer, T., Steimann, F. (eds.) ACM SIGPLAN In- 
ternational Conference on Software Language Engineering. pp. 29-41. Boston, 
MA, USA (Nov 2018). https://doi.org/10.1145/3276604.3276615, https://hal. 
archives-ouvertes.fr/hal-01890044 

Rosenfeld, R.: Package rancid-cgi: looking glass cgi based on rancid tools (2019), 
https:/ /packages.debian.org/en/sid/rancid-cgi 

Smolka, G.: Feature constraint logics for unification grammars. Journal of Logic 
Programming 12, 51-87 (1992) 

Smolka, G., Treinen, R.: Records for logic programming. Journal of Logic Pro- 
gramming 18(3), 229—258 (Apr 1994) 

The  CoLiS project: The  CoLiS bench.  http://ginette.informatique. 
univ-paris-diderot.fr/ ^ niols/colis-batch/ 

The CoLiS project: The CoLiS toolchain. https:/ /github.com/colis-anr 


36. 


37. 


38. 


39. 


Analysing installation scenarios of Debian packages 253 


The CoLiS project: Artifact for Analysing installation scenarios of Debian Pack- 
ages. Zenodo Repository (Feb 2020). https://doi.org/10.5281/zenodo.3678390 
The Debian Project: Bugs tagged colis, https://bugs.debian.org/cgi- bin/ 
pkgreport.cgi?tag=colis-shparser;users=treinen@debian.org 

The Linux Foundation: Filesystem hierarchy standard, version 3.0 (Mar 2015), 
https: //refspecs.linuxfoundation.org 

Ucko, A.M.: cmigrep: broken emacsen-install script. Debian Bug Report 431131 
(Jun 2007), https: //bugs.debian.org/cgi- bin/bugreport.cgi?bug=431131 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 


or format, as long as you give appropriate credit to the original author(s) and the 


source, provide a link to the Creative Commons license and indicate if changes were 


made. 


The images or other third party material in this chapter are included in the chapter’s 


Creative Commons license, unless indicated otherwise in a credit line to the material. If 


material is not included in the chapter’s Creative Commons license and your intended 


use is not permitted by statutory regulation or exceeds the permitted use, you will need 


to obtain permission directly from the copyright holder. 


® 


Check for 
updates 


Endicheck: Dynamic Analysis for Detecting 
Endianness Bugs 


TACAS 
Artifact 


T Evaluation 
Department of Distributed and Dependable Systems, 2020 


Faculty of Mathematics and Physics, Charles University, 
Prague, Czech Republic 


Roman Kápl and Pavel Parízek 


Accepted 


Abstract. Computers store numbers in two mutually incompatible ways: little- 
endian or big-endian. They differ in the order of bytes within representation of 
numbers. This ordering is called endianness. When two computer systems, pro- 
grams or devices communicate, they must agree on which endianness to use, in 
order to avoid misinterpretation of numeric data values. 

We present Endicheck, a dynamic analysis tool for detecting endianness bugs, 
which is based on the popular Valgrind framework. It helps developers to find 
those code locations in their program where they forgot to swap bytes prop- 
erly. Endicheck requires less source code annotations than existing tools, such 
as Sparse used by Linux kernel developers, and it can also detect potential bugs 
that would only manifest if the given program was run on computer with an oppo- 
site endianness. Our approach has been evaluated and validated on the Radeon SI 
Linux OpenGL driver, which is known to contain endianness-related bugs, and on 
several open-source programs. Results of experiments show that Endicheck can 
successfully identify many endianness-related bugs and provide useful diagnostic 
messages together with the source code locations of respective bugs. 


1 Introduction 


Modern computers represent and store numbers in two mutually incompatible ways: 
little-endian (with the least-significant byte first) or big endian (the most-significant 
byte first). The byte order is also referred to as endianness. 

Processor architectures typically define a native endianness, in which the proces- 
sor stores all data. When two computer systems or programs exchange data (e.g., via 
a network), they must first agree on which endianness to use, in order to avoid mis- 
interpretation of numeric data values. Also devices connected to computers may have 
control interfaces with endianness different from the host's native endianness. 

Therefore, programs communicating with other computers and devices need to 
swap the bytes inside all numerical values to the correct endianness. We use the term 
target endianness to identify the endianness a program should use for data exchanged 
with a particular external entity. Note that in some cases it is not necessary to know 
whether the target endianness is actually little-endian or big-endian. When the knowl- 
edge is important within the given context, we use the term concrete endianness. 

If the developer forgets to transform data into the correct target endianness, the bug 
can often go unnoticed for a long time because software is nowadays usually devel- 
oped and tested on the little-endian x86 or ARM processor architecture. For example, 
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if two identical programs running on a little-endian architecture communicate over the 
network using a big-endian protocol, a missing byte-order transformation in the same 
place in code will not be observed. Our work on this project was, in the first place, 
motivated by the following concrete manifestation of the general issue described in the 
previous sentence. The Linux OpenGL driver for Radeon SI graphics cards (the Mesa 
17.4 version) does not work on big-endian computers due to an endianness-related bug!] 
as the first author discovered when he was working on an industrial project that involved 
PowerPC computers in which Radeon graphic cards should be deployed. 

We are aware of few approaches to detection of endianness bugs, which are based on 
static analysis and manually written source code annotations. An example is Sparse [11]. 
a static analysis tool used by Linux kernel developers to identify code locations where 
byte-swaps are missing. The analysis performed by Sparse works basically in the same 
way as type checking for C programs, and relies on the usage of specialized bitwise data 
types, such as _le16 and _be32, for all variables with non-native endianness. Integers 
with different concrete endianness are considered by Sparse as having mutually incom- 
patible types, and the specialized types are also not compatible with regular C integer 
types. In addition, macros like _le32_to_cpu are provided to enable safe conversion 
between values of the bitwise integer types and integer values of regular types. Such 
macros are specially annotated so that the analysis can recognize them, and developers 
are expected to use only those macros. 

The biggest advantage of bitwise types is that a developer cannot assign a regular 
native endianness integer value to a variable of a bitwise type, or vice versa. Their 
nature also prevents the developer from using them in arithmetic operations, which do 
not work correctly on values with non-native byte order. On the other hand, a significant 
limitation of Sparse is that developers have to properly define the bitwise types for all 
data where endianness matters, and in particular to enable identification of data with 
concrete endianness — Sparse would produce imprecise results otherwise. Substantial 
manual effort is therefore required to create all the bitwise types and annotations. 

Our goals in this whole project were to explore an approach based on dynamic anal- 
ysis, and to reduce the amount of necessary annotations in the source code of a subject 
program. We present Endicheck, a dynamic analysis tool for detecting endianness bugs 
that is implemented as a plugin for the Valgrind framework [6]. The main purpose of 
the dynamic analysis performed by Endicheck is to track endianness of all data val- 
ues in the running subject program and report when any data leaving the program has 
the wrong endianness. The primary target domain consists of programs written in C or 
C++, and in which developers need to explicitly deal with endianness of data values. 

While the method for endianness tracking that we present is to a large degree in- 
spired by dynamic faint analyses (see, e.g., [8. our initial experiments showed that 
usage of existing taint analysis techniques and tools does not give good results espe- 
cially with respect to precision. For example, an important limitation of the basic taint 
analysis, when used for endianness checking, is that it would report false positives on 
data that needs no byte-swapping, such as single byte-sized values. Therefore, we had to 
modify and extend the existing taint analysis algorithms for the purpose of endianness 
checking. During our work on Endicheck, we also had to solve many associated tech- 
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nical challenges, especially regarding storage and propagation of metadata that contain 
the endianness information — this includes, for example, precise tracking of single-byte 
values. 

Endicheck is meant to be used only during the development and testing phases of 
the software lifecycle, mainly because it incurs a substantial runtime overhead that is 
not adequate for production deployment. Before our Endicheck tool can be used, the 
subject program needs to be modified, but only to inform the analysis engine where the 
byte-order is being swapped and where data values are leaving the program. In C and 
C++ programs, byte-order swapping is typically done by macros provided in the system 
C library, such as htons/htonl or those defined in the endian.h header file. Thus only 
these macros need to be annotated. During the development of Endicheck, we redefined 
each of those macros such that the custom variant calls the original macro and defines 
necessary annotations — for examples, see Figure[I] in Section [4] and the customized 
header file inet.t?| Similarly, data also tend to leave the program only through few 
procedures. For some programs, the appropriate place to check for correct endianness 
is the send/write family of system calls. 

Endicheck is released under the GPL license. Its source code is available at[ht tps : | 
//github.com/rkapl/endicheck 

The rest of the paper is structured as follows. SectionD]begins with a more thorough 
overview of the dynamic analysis used by Endicheck, and then it provides details about 
the way endianness information for data values are stored and propagated — this rep- 
resents our main technical contribution, together with evaluation of Endicheck on the 
Radeon SI driver and several other real programs that is described in Section[5] Besides 
that, we also provide some details about the implementation of Endicheck (Section 3p 
together with a short user guide (Section[4). 


2 Dynamic Analysis for Checking Endianness 


We have already mentioned that the dynamic analysis used by Endicheck to detect 
endianness bugs is a special variant of taint analysis, since it uses and adapts some 
related concepts. In the rest of this paper, we use the term endianness analysis. 


2. Algorithm Overview 


Here we present a high-level overview of the key aspects of the endianness analysis. 
Like common taint and data-flow analysis techniques (see, e.g., and (8. our dy- 
namic endianness analysis tracks flow of data through program execution, together with 
some metadata attached to specific data values. The analysis needs to attach metadata to 
all memory locations for which endianness matters, and maintain them properly. Meta- 
data associated with a sequence of bytes (memory locations) that makes a numeric data 
value then capture its endianness. Similarly to many dynamic analyses, the metadata are 
stored using a mechanism called shadow memory (9]. We give more details about 
the shadow memory in Section[2.2] 


* https://github.com/rkapl/endicheck/blob/master/endicheck/ec-overlay/arpa/inet.h 
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Although we mostly focus on checking that the program being analyzed does not 
transmit data of incorrect endianness to other parties, there is also the opposite problem: 
ensuring that the program does not use data of other than native endianness. For this 
reason, our endianness analysis could be also used to check whether all operands of an 
arithmetic instruction have the correct native endianness — this is important because 
arithmetic operations are unlikely to produce correct results otherwise. Note, however, 
that checking of native endianness for operands has not yet been implemented in the 
Endicheck tool. 

The basic principle behind the dynamic endianness analysis is to watch instructions 
as they are being executed and check endianness at specific code locations, such as 
the calls of I/O functions. We use the term V/O functions to identify all system calls 
and other functions that encapsulate data exchange between a program and external 
entities (e.g., writing or reading data to/from a hard disk, or network communication) in 
a specific endianness. When the program execution reaches the call of an I/O function, 
Endicheck checks whether all its arguments have the proper endianness. Note that the 
user of Endicheck specifies the set of I/O functions by annotations (listed in Section]. 

In order to properly maintain the endianness information stored in the shadow mem- 
ory, our analysis needs to track almost every instruction being executed during the run 
of a subject program. The analysis receives notifications about relevant events from the 
Valgrind dynamic analysis engine. All the necessary code for tracking individual in- 
structions (processing the corresponding events), updating endianness metadata (inside 
the shadow memory), and checking endianness at the call sites of I/O functions, is added 
to the subject program through dynamic binary instrumentation. Further technical de- 
tails about the integration of Endicheck into Valgrind are provided later in Section[] 

Two distinguishing aspects of the endianness analysis — the format of metadata 
stored in the shadow memory and the way metadata are propagated during the analysis 
of program execution — are described in the following subsections. 


2.2 Shadow Memory 


A very important requirement on the organization and structure of shadow memory was 
full transparency for any C/C++ or machine code program. The original layout of heap 
and stack has to be preserved during the analysis run, since Endicheck (and Valgrind 
in general) targets C and C++ programs that typically rely on the precise layout of data 
structures in memory. Consequently, Endicheck cannot allocate the space for shadow 
memory (metadata) within the data structures of the analyzed program. 

When designing the endianness analysis, we decided to use the mechanism sup- 
ported by Valgrind (7]. which allows client analyses to store a tag for each byte in the 
virtual address space of the analyzed program without changing its memory layout. This 
mechanism keeps a translation table (similar to page tables used by operating systems) 
that maps memory pages to shadow pages where the metadata are stored. 

The naive approach would be to follow the same principles as taint analyses, i.e. 
reuse the idea of taint bits, and mark each byte of memory as being either of native 
endianness or target endianness. However, our endianness analysis actually uses a richer 
format of metadata and individual tags, which improves the analysis precision. 
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Rich Metadata Format. In this format of metadata, each byte of memory and each 
processor register is annotated with one of the following tags that represent available 
knowledge about the endianness of stored data values. 


native: The default endianness produced, for example, by arithmetic operations. 
target: Used for data produced by annotated byte-swapping function. 

byte-sized: Marks the first byte of a multi-byte value (e.g., an integer or float). 

— unknown: Endianness of uninitialized data (e.g., newly allocated memory blocks). 


In addition to these four tags, each byte of memory can also be annotated with the 
empty flag, indicating that the byte's value is zero. Now we give more details about the 
meaning of these tags, and discuss some of the associated challenges. 


Single-byte values. Our approach to precise handling of single-byte values is moti- 
vated by the way arithmetic operations are processed. Determining the correct size of 
the result of an arithmetic operation (in terms of the number of actually used bytes) 
is difficult in practice, because compilers often choose to use instructions that operate 
on wider types than actually specified by the developer in program source code. This 
means the analysis cannot, in some cases, precisely determine whether the result of an 
arithmetic operation has only a single byte. Our solution is to always mark the least- 
significant byte of the result with the tag byte-sized. Such an approach guarantees that if 
only the least-significant byte of an integer value is actually used, it does not trigger any 
endianness errors when checked, because the respective memory location is not tagged 
as native. On the other hand, if the whole integer value is really used (or at least more 
than just the least-significant byte), there is one byte marked with the tag byte-sized 
and the rest of the bytes are marked as native, thus causing an endianness error when 
checked. 


Empty byte flag. Usage of the empty flag helps to improve performance of the en- 
dianness analysis when processing byte-shuffling instructions, because all operations 
with empty flags are simpler than operations with the actual values. However, this flag 
can be soundly used only when the operands are byte-wise disjoint, i.e. when each byte 
is zero (empty) in at least one of the operands. Arithmetic operations are handled in a 
simplified way — they never mark bytes as empty in the result. Consequently, while the 
empty tag implies that the given byte is zero, the reverse implication does not hold. 


Unknown tag. We introduced the tag unknown in order to better handle data values, 
for which the analysis cannot say whether they are already byte-swapped. Endicheck 
uses this tag especially for uninitialized data. Values marked with the tag unknown are 
not reported as erroneous by default, but this behavior is configurable. We discuss other 
related problems, concerning especially precision, below in Sectionp.4] 


2.3 Propagation of Metadata 


An important aspect of the endianness analysis is that data values produced by the 
subject program are marked as having the native endianness by default. This behav- 
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ior matches the prevailing case, because data produced by most instructions (e.g., by 
arithmetic operations) and constant values can be assumed to have native endianness. 

In general, metadata are propagated upon execution of an instruction according to 
the following policy: 


— Arithmetic operations always produce native-endianness result values. 
— Data manipulation operations (e.g., load and store) propagate tags from their operands 
to results without any changes. 


Endicheck correctly passes metadata also through routines such as memcpy and certain 
byte-shuffling operations (e.g., shift <<= and >>=). Complete details for all categories 
of instructions and routines are provided in the master thesis of the first author (3). 

The only way to create data with the target tag is via explicit annotation from the 
user. Specifically, the user needs to add annotations to byte-swapping functions in order 
to set the target tag on return values. 


2.4 Discussion: Analysis Design and Precision 


The basic scenario that is obviously supported by our analysis is the detection of endi- 
anness bugs when the target and native endianness are different. However, the design of 
our analysis ensures that it can be useful even in cases when the native endianness is the 
same as the target endianness. Although byte-swapping functions then become identi- 
ties, the endianness analysis can still find data that would not be byte-swapped if the 
endianities were different — it can do this by setting the respective tags when data pass 
through the byte-swapping functions. In addition, the endianness analysis can be also 
used to detect the opposite direction of errors — programs using non-native endianness 
data values (e.g., received as input) without byte-swapping them first. 

Endicheck does not handle constants and immediate values in instructions very well, 
since the analysis cannot automatically recognize their endianness and therefore cannot 
determine whether the data need byte-swapping or not. Constants stored in the data 
section of a binary executable represent the main practical problem to the analysis, 
because the data section does not have any structure — it is just a stream of bytes. Our 
solution is to mark data sections initially with the tag unknown. If this is not sufficient, 
a user must annotate the constants in the program source code to indicate whether they 
already have the correct endianness. 

A possible source of false bug reports are unused bytes within a block of memory 
that has undefined content, unless the memory was cleared with Os right after its allo- 
cation. This may occur, for example, when some fields inside C structures have specific 
alignment requirements. Some space between individual fields inside the structure lay- 
out is then unused, and marked either with the tag unknown or with the tag left over 
from the previous content of the memory block. 


3 Implementation 


We distribute the Endicheck tool in the form of an open source software package that 
was initially created as a fork of the Valgrind source code repository. Although tools 
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and plugins for Valgrind can be maintained as separate projects, forking allowed us 
to make changes to the Valgrind core and use its build/test infrastructure. Within the 
whole source tree of Endicheck, which includes the forked Valgrind codebase, the code 
specific to Endicheck is located in the endicheck directory. It consists of these modules: 


— ec.main: tool initialization, command-line handling and routines for translation 
to/from intermediate representation; 

— ec.errors: error reporting, formatting and deduplication; 

— ec.shadow: management of the shadow memory, storing of the endianness meta- 
data, protection status and origin tracking information (see below); 

— ec util: utility functions for general use and for manipulation with the metadata; 

— endicheck.h: public API with annotations to be used in programs by developers. 


In the rest of this section, we briefly describe how Endicheck uses the Valgrind 
infrastructure and a few other important features. Additional technical details about the 
implementation are provided in the master thesis of the first author [3]. 


Usage of Valgrind infrastructure. Endicheck depends on the Valgrind core (1) for dy- 
namic just-in-time instrumentation (6) of a target binary program and (ii) for the actual 
dynamic analysis of program execution. The subject binary program is instrumented 
with code that carries out all the tasks required by our endianness analysis — especially 
recording of important events and tracking information about the endianness of data val- 
ues. When implementing the Endicheck plugin, we only had to provide code doing the 
instrumentation itself and define what code has to be injected at certain locations in the 
subject program. Note also that for the analysis to work correctly and provide accurate 
results, Valgrind instruments all components of the subject program that may possibly 
handle byte-swapped data, including application code, the system C library and other 
libraries. During the analysis run, Valgrind notifies the Endicheck plugin about execu- 
tion of relevant instructions and Endicheck updates the information about endianness 
of affected data values accordingly. Besides instrumentation and the actual dynamic 
analysis, other features and mechanisms of the Valgrind framework used by Endicheck 
include: utility functions, origin tracking, and developer-friendly error reporting. 

Origin tracking [1] is a mechanism that can help users in debugging the endianness 
issues. An error report contains two stack traces: one identifies the source code loca- 
tion of the call to the I/O function where the wrong endianness of some data value was 
detected, and the second trace, provided by origin tracking, identifies the source code 
location where the value has originated. In Endicheck, the origin information (identi- 
fier of the stack trace and execution context) is stored alongside the other metadata in 
the shadow memory for all values. We decided to use this approach because almost 
all values need origin tracking, since they can be sources of errors — in contrast to 
Memcheck, where only the uninitialized values can be sources of errors. 

During our experiments with the Radeon SI OpenGL driver (described in Sec- 
tion|5.1), we have noticed that the driver maps the device memory into the user-space 
process. In that case, there is no single obvious point where to check the endianness 
of data that leave the program through the mapped memory. To solve this problem and 
support memory-mapped I/O, we extended our analysis to automatically check endian- 
ness at all writes to regions of the mapped device memory. We implemented this feature 
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in such a way that each byte of a device memory region is tagged with a special flag 
protected — then, Endicheck can find very quickly whether some region of memory 
is mapped to a device or not. Note that the flag is associated with a memory location, 
while the endianness tags (described in Section |2.2) are associated with data values. 
Therefore, the special flag is not copied, e.g. when execution of memcpy is analyzed; it 
can be only set explicitly by the user. 


4 User Guide 


The recommended way to install Endicheck is building from the source code. Instruc- 
tions are provided in the README file at the project web site. When Endicheck has 
been installed, a user can run it by executing the following command: 


valgrind tool=endicheck [OPTIONS...] PROGRAM ARGS... 


Origin tracking is enabled by the option -track-originszyes. 


Annotations In order to analyze a given program, some annotations typically must be 
added into the program source code. A user of Endicheck has to mark the byte-swapping 
functions and the I/O functions (through which data values are leaving the program), 
because these functions cannot be reliably detected in an automated way. 

The specific annotations are defined in the C header file endicheck.h. Here follows 
the list of supported annotations, together with explanation of their meaning: 


— EC. MARK. ENDIANITY (start, size, endianness) 
This annotation marks a region of memory from start to start+size-1 as having the 
given endianness. It should be used in byte-swapping functions. Target endianness 
is represented by the symbol EC TARGET. 

— EC.CHECK.ENDIANITY (start, size, msg) 
This annotation enforces a check that a memory region from start to start+size-1 
contains only data with any or target endianness. It should be used in I/O functions. 
Unknown endianness is allowed by passing the —allow-unknown option. 

— EC.PROTECT.REGION(start, size) 
Marks the given region of memory as protected. This should be used for mapped 
regions of device memory. 

— EC_UNPROTECT_REGION(start, size) 
Marks the given memory region as unprotected. 

— EC_DUMP_MEM(start, size) 
Dumps endianness of a memory region. This is useful for debugging. 


Figure[I|shows an example program that demonstrates usage of the most important 
annotations (EC. MARK and EC. CHECK). If the call to htobe32 inside main is removed, 
Endicheck will report an endianness bug. This example also demonstrates possible ways 
to easily annotate standard functions, like htobe32 and write. 
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include <valgrind/endicheck.h> 


uint32 t htobe32(uint32 t x) { 


if BYTE ORDER == ORDER LITTLE ENDIAN 
x = bswap 32(x); 
endif 
EC MARK ENDIANITY(&x, sizeof(x), EC, TARGET); 


return x; 


int ec write(int file, const void xbuf, size t count) { 
EC CHECK ENDIANITY (buf, count, NULL); 
return write(file, buf, count); 

} 


#define write ec_write 


int main() { 
uint32_t x = OxDEADBEEF; 
x = htobe32 (x); 
write(0, &x, sizeof(x)); 
return 0; 


Fig. 1. Small example program with Endicheck annotations. 


5 Evaluation 


We evaluated the Endicheck tool — namely its ability to find endianness bugs, precision 
and overhead — by the means of a case study on the Radeon SI driver, several open- 
source programs and a standardized performance benchmark. For the Radeon SI driver 
and each of the open-source programs, we provide a link to its source code repository 
(and identification of the specific version that we used for our evaluation) within the 
artifact that is referenced from the project web site. 


5.1 Case Study 


Our case study is Radeon SI, the Linux OpenGL driver for Radeon graphics cards, start- 
ing with the SI (Southern Islands) line of cards and continuing to the current models. 

Since these Radeon cards are little-endian, the driver must byte-swap all data when 
running on a big-endian architecture such as PowerPC. However, the Radeon SI driver 
(in the Mesa 17.4 version) does not perform the necessary byte-swapping operations, 
and therefore simply does not work in the case of PowerPC — it crashes either the GPU 
or OpenGL programs using the driver. In particular, endianness bugs in this version of 
the Radeon SI driver cause the Glxgears demo on PowerPC to crash. We give more 
details about the bugs we have found in Section[5.2] 

An important feature of the whole Linux OpenGL stack is that all layers, includ- 
ing the user-space program, communicate not only using calls of library functions and 
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system calls, but they also extensively use mapping of the device memory directly into 
the user process. Given such an environment, Endicheck has to correctly handle (1) the 
flow of data through the whole OpenGL stack by instrumenting all the libraries used, 
and (2) communication through the shared memory that is used by the driver. This is 
why the support for mapped memory in Endicheck, through marking of device memory 
with a special flag, as described above in Section[B] is essential. 


5.2 Search for Bugs 


For the purpose of evaluating Endicheck's ability to find endianness bugs, we picked a 
diverse set of open-source programs (in addition to the Radeon SI driver), including the 
following: BusyBox, OpenTTD, X.Org and ImageMagick. All programs are listed in 
Table[1] The only criterion was to select programs written in C that communicate over 
the network or store data in binary files, since only such programs may possibly contain 
endianness bugs. We also document our experience with fixing the endianness bugs in 
the Radeon SI driver and other programs. 

One of the stated goals for Endicheck was to reduce the number of annotations that 
a user must add into the program source code in order to enable search for endianness 
bugs. Therefore, below we report the relevant measurements and discuss whether (and 
to what degree) this goal has been achieved. 

In the rest of this section, first we discuss application of Endicheck on the Radeon 
SI driver (our case study) and then we present results for other programs. 


Radeon SI case study. Within our case study, we have used the Glxgears demo pro- 
gram as a test harness for the Radeon SI driver. Initially we have run Glxgears on the 
x86 architecture, and after fixing all the issues found and reported by Endicheck, we 
moved the same graphics card to a PowerPC host computer and continued testing there. 

In the case of the Radeon SI driver, all byte-swapping functions are located in a 
single file of one library (Gallium) on the OpenGL stack. Therefore, to enable search 
for endianness bugs in Radeon SI, we had to make just two changes: (1) annotate the 
function radeon. drm.cs add. buffer as I/O function and (2) annotate the byte-swapping 
functions in Gallium. Overall, we had to add or change about 40 lines of source code, 
including annotations, in a single place. All our changes are published in the repository 
It contains the source code of 
Mesa augmented with our annotations and fixes for the endianness-related bugs in 
Radeon SI described below. For fixes of bugs found by Endicheck, we included the 
original Endicheck report in the commit message, under the ECNOTE header. 

Figure [2|contains an example bug report produced by Endicheck with enabled ori- 
gin tracking on Glxgears. The error report itself has three main parts (in this order): the 
problem description, origin stack trace (captured when the offending value is created) 
and point-of-check stack trace (recorded when some annotated I/O function is encoun- 
tered). We show only fragments of stack traces for illustration (and to save space). 

The problem description identifies the currently active thread, the nature of the error 
and the memory region containing the erroneous value. The memory region is identified 
by its address and an optional name provided by the program ("radeon.add.buffer" in 
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Thread 9 gallium_drv:0: 
Memory does not contain data of Target endianness 
Problem was found in block 0x41BF000 (named radeon add buffer) 
at offset 0, size 8: 
Ox41BF000: N N N N N N NN 
The value was probably created at this point: 
at 0x8B787F7: si init msaa functions (si state msaa.c:94) 
by 0x8B4F979: si create context (si pipe.c:279) 


by 0x4C46661: glXCreateContext (glxcmds.c:427) 
by 0x10B67A: make window.constprop.1 (glxgears.c:559) 
by 0x109A86: main (glxgears.c:777) 
The endianness check was requested here: 
at 0x8B85C45: radeon drm cs add buffer (radeon drm cs.c:375) 
by 0x8B4A58B: si set constant buffer (r600 cs.h:74) 
by 0x8B708D0: si set framebuffer state (si state.c:2934) 


by 0x55357FB: start thread (pthread create.c:465) 
by 0x5861BOE: clone (clone.S:95) 


Fig. 2. Error report from Endicheck run on the Glxgears demo program 


this case). Metadata are printed just for the part of the memory region that contains data 
with the wrong endianness, using this convention: N = native, U = undefined. 

This particular error report (Figure) indicates that an array of floating-point values 
describing the multisampling pattern is not byte-swapped. Note that IEEE 754 floating 
point values also obey the endianness of the host platform, at least on the architectures 
x86, x64 and ARM. To repair the corresponding bug, we had to insert calls of byte- 
swapping functions at the code location where the floating-point array is produced. 

During our experiments with Radeon SI and Glxgears, four endianness bugs in total 
were detected by Endicheck on the x86 architecture before testing on PowerPC. After 
we fixed the bugs, the Glxgears demo did successfully run. This shows that Endicheck 
detected all bugs it was supposed to and provided reports useful enough so that the bugs 
could be fixed. Here we also need to emphasize that the Glxgears demo, naturally, does 
not exercise all code in the Radeon SI driver, and fixing the whole driver would require 
lot of additional work. 


Other programs. As we said at the beginning of this section, we evaluated Endicheck's 
ability to find endianness bugs and precision on a set of realistic programs. Our primary 
goal in this part of the evaluation was to assess the following aspects: 


— the extent of annotations that is required for Endicheck to work properly, 
— whether Endicheck is able to detect a bug in a given kind of programs, and 
— how many false warnings are reported. 


Before trying to answer these questions, we wanted to be sure that the subject pro- 
grams contain endianness bugs. However, some of the programs that we considered 
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(OpenTTD, OpenArena and ImageMagick) are written in such a way that realistic en- 
dianness bugs cannot be injected into their codebase. ImageMagick uses a C++ ab- 
straction layer for binary streams, which also handles endianness. OpenArena uses bit- 
oriented encoding for most parts of the network communication. OpenTTD uses an 
abstraction layer too, but the developer can still make an endianness-related mistake in 
certain cases, such as storing an array of uint16.t values as an array of uint8.t values. We 
manually injected synthetic endianness bugs into the code of all the programs where 
this was possible. In this process, we also annotated the byte-swapping functions (like 
htonl). The bugs were created by removing one usage of byte-swapping functions. 

The results of experiments are summarized in Table[I] For each program, the table 
provides the following information: whether it was possible to analyze the program at 
all, whether some endianness bugs were found, overhead related to false warnings, and 
how many lines of source code were added or changed in relation to Endicheck anno- 
tations. Data for the Radeon SI driver are also included in the table for completeness. 


Program Analyzable Injected bug False positives  |Actual bugs | Annotations 
Radeon SI driver | v/ Yes v Found @Manageable (2)| v Found cca 40 lines 
BusyBox v Yes v Found Y No None found 20 lines 
OpenTTD v Partially |v Found @Manageable (2)| None found|59 lines 
Ntpd v Yes v Found v No None found 1 line 
X.Org v Yes v Found Y No v Found 30 lines 
OpenArena @No 

ImageMagick ØNo 


Table 1. Search for bugs: precision and necessary annotations 


Data in Table[I]show that Endicheck could find the introduced bug in all the cases. 
Furthermore, Endicheck found two genuine endianness-related bugs in X.Org. The bugs 
were confirmed by the developers of X.Org and fixed in EXAM 

Endicheck also reports some false warnings, but their numbers are not overwhelm- 
ing. Four cases in total occured for the Radeon SI driver and OpenTTD (two in each). 
This is a manageable amount, which can be even suppressed using further annotations. 


5.3 Performance 


In this section, we report on the performance of Endicheck in terms of execution time 
overhead it introduces. We compare the performance data for programs instrumented 
with Endicheck, programs instrumented by the Memcheck plugin for Valgrind and pro- 
grams without any instrumentation. For the purpose of experiments, we used the stan- 
dardized benchmark SPEC CPU2000. Even though SPEC CPU2000 is a general bench- 
mark, not tailored for endianness analysis, results of experiments with this benchmark 
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indicate the performance of Endicheck when doing a real analysis, because the control- 
flow paths exercised within Endicheck and the Valgrind core during an experiment do 
not depend on the specific metadata (tag values). 

We run all experiments on a T550 ThinkPad notebook with 12 GiB of RAM and 
an 15-5200 processor clocked at 2.20 GHz, under Arch Linux from Q2 2018. The 
SPEC2000 test harness was used for all the runs, with iteration count set to 3. We 
compiled both Memcheck and Endicheck by GCC v7.3.0 with default options. Note 
that we had to omit the benchmark program “gap”, because it produced invalid results 
when compiled with this version of GCC. 

In the description of specific experiments, tables with results and their discussion, 
we use the following abbreviations: 


— EC: Endicheck (valgrind -tool=endicheck) 

— MC: Memcheck (valgrind —-tool=memcheck) 

-OT: with precise origin tracking enabled (-track-originszyes) 

-IT: with origin tracking enabled, but not fully precise (-precise-origins=no) 
— -P: with memory protection enabled (—protection=yes) 


Execution time. We divided our experiments designed for measuring the execution 
time into two groups. Our motivation was to ensure that all experiments, including the 
EC-OT configuration that incurs a large overhead, finish within a reasonable time limit. 
In the first group, we run the full range of configurations on the “test” data set provided 
by SPEC CPU2000, which is small compared to the full “reference” set, and used MC 
as the baseline for comparisons. Table]shows results for experiments in this group. All 
execution time data provided in this table are relative to MC, with the exception of data 
for the native configuration. The second group of experiments uses the full "reference" 
data set from SPEC CPU2000. Results for this group are provided in Table [B] In this 
case, we used the data for native (uninstrumented) programs as the baseline. 


Program | Native (s) MC (s) MC-OT| EC |EC-P|EC-OT| EC-IT 
bzip2 1.38} 19.40) 2.27x 2.07x 2.23x | 33.87x 12.58x 
crafty 0.70} 18.70| 2.21x|1.74x|1.78x| 30.59x |11.07x 
eon 0.09} 6.60] 1.73x|1.29x|1.34x| 12.89x | 4.23x 
gcc 0.31| 12.70|  1.96x|1.92x|1.98x| 24.17x| 9.53x 
gzip 0.47, 6.29| 2.11x|1.86x|1.97x| 41.97x |14.96x 
mcf 0.05}  0.85| 2.38x|1.27x|1.32x| 11.88x| 7.08x 
parser 0.66} 10.50!  2.19x/2.13x|2.28x | 41.24x|16.29x 
perlbmk 4.31} 5.52] 1.10x|0.95x|0.95x| 1.17x| 1.05x 
twolf 0.05} 1.64] 1.88x|1.16x|1.20x| 14.09x| 5.51x 
vortex 1.06} 56.90) 2.23x 1.95x 2.04x| 28.38x | 9.86x 
vpr 0.49} 8.02] 2.00x|1.70x|1.75x|22.94x| 8.30x 
G.mean 0.41) 7.86| 1.97x|1.59x|1.65x| 18.17x| 7.56x 


Table 2. Execution times for the SPEC CPU2000 test data set, relative to Memcheck. 
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Program]| Native (s)| MC EC EC-P 
bzip2 66.3}11.63x] 23.47x| 24.45x 
crafty 29.5|26.78x| 48.10x| 48.54x 
eon 24.1/52.12x| 93.36x| 97.34x 
gcc 27.8|27.73x|116.62x |122.48x 
gzip 79.9} 8.92x! 15.93x| 16.80x 
mcf 67.10) 2.71x| 6.90x| 6.94x 
parser 89.9]10.78x| 23.04x| 23.86x 
perlbmk 45.9}38.45x| 93.62x| 96.27x 
twolf 93}12.43x| 19.77x| 19.52x 
vortex 43.8|44.36x| 91.03x| 92.85x 
vpr 54.7|10.49x} 20.29x| 20.68x 
G.mean 51.29116.59x| 35.31x| 36.25x 


Table 3. Execution times for the SPEC CPU2000 reference data set, relative to native runs. 


Data in Table [3] indicate that the average slowdown of Memcheck is by the factor 
of 16.59. Endicheck, in comparison, slows down the analyzed program by the factor 
of 35.31. This means Endicheck has roughly two times higher overhead than Mem- 
check with default options. According to data in Table [2| the same relative slowdown 
of Endicheck with respect to Memcheck is 1.65x. This difference between the results 
for the reference and test data sets is caused by the different ratio of the time spent 
instrumenting the code versus time spent running the instrumented code. 

However, data in both tables also show that the performance of Endicheck with ori- 
gin tracking is lacking compared to Memcheck with the same option. It was still usable 
for our Radeon SI OpenGL tests, but measurements indicate that there is a space for op- 
timization. Nevertheless, certain relative slowdown between the configurations EC-OT 
and MC-OT probably cannot be avoided, because Endicheck must track origin infor- 
mation for much more data than Memcheck. Based on our experiments, we observed 
that creating the origin information is the most expensive operation involved. When the 
origin tags are created for each superblock, instead of every instruction, the execution 
times drop roughly by a factor of two (see the columns EC-OT and EC-IT). 


5.4 Discussion 


Based on the case study and results of experiments presented in the previous sections, 
we make the following general conclusions: 


— Endicheck can find true endianness bugs in large real programs, assuming that the 
user correctly annotates all the byte-swapping functions and I/O functions. 

— Using fairly complex metadata is feasible in terms of performance and encoding. 

— Performance of Endicheck is practical even on large programs, despite the overhead 
and given that its current version is not yet optimized as well as Memcheck. 

— Although Endicheck, due to precise dynamic analysis, requires less annotations to 
be specified manually than static analysis-based tools (e.g., Sparse), still it puts 
certain burden on the user. 
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Regarding the annotation burden, we already mentioned that the user has to carefully 
mark in particular all the I/O functions and byte-swapping functions, so that Endicheck 
can correctly update endianness tags associated with memory locations during the run 
of the analysis. While it would be possible to recognize byte-swapping functions au- 
tomatically, e.g. by static code analysis, then the endianness analysis would have to be 
run on a machine with the native endianness different from the target endianness, so 
that actual byte-swaps will be present. 

Another limitation of Endicheck from the practical perspective is handling of com- 
plex data transformations, a problem shared with taint analysis. The metadata cannot 
be correctly preserved through transformations such as encryption/decryption and com- 
pression/decompression. However, in many cases, the problem could be avoided by re- 
quiring an endianness check to be performed just before the respective transformation. 


6 Related Work 


As far as we know, the Sparse tool used by Linux kernel developers, which we 
already mentioned, is the only one publicly available specialized tool tackling the prob- 
lem of finding endianness bugs. The main advantage of Endicheck over Sparse is better 
precision in some cases, i.e. fewer false bug reports, since dynamic analysis, which ob- 
serves actual program execution and runtime data values, is typically more precise than 
static analysis. Endicheck also does not require so many annotations of functions and 
variables as Sparse — when using Endicheck, typically just few places in the program 
source code need to be annotated manually. More specifically, Sparse expects that an 
input program code involves (i) the specialized bitwise data types (e.g., .le32) for all 
variables where endianness matters and (ii) the macros for conversion between regular 
types and bitwise types (e.g., .le32 to cpu). With Endicheck, developers only have to 
annotate the byte-swapping functions used by the program (e.g., htons and htonl from 
the C library). On the other hand, Sparse has better coverage of program code, as it is 
based on static analysis. 

The Valgrind dynamic analysis framework [6] comes bundled with a set of bug de- 
tection tools. Very popular is the Memcheck MIS for detecting memory access errors 
and leaks, which also served as an inspiration for the design and implementation of En- 
dicheck. We mention the tool here, because it actually performs a variant of dynamic 
taint analysis — it marks each bit of the program memory as valid or invalid (tainted). 

Closely related is also the runtime type checker Hobbes [2] for binary executables, 
which can detect some kinds of type mismatch bugs common in C programs. In order 
to reduce the number of false bug reports and to delimit integer values, Hobbes uses 
the mechanism of continuation markers — the first byte of each value has the marker 
unset, and the remaining bytes are set to indicate that they represent a continuation of 
an existing value. The analysis technique used by Hobbes could be modified to track 
endianness of integer values instead of distinguishing between pointers and integers, 
since one can model integers of different endianness as values that have different types 
(also like in the case of Sparse). 

Another approach with functionality similar to Endicheck has been implemented 
within the LLVM/Clang plugin called DataFlowSanitizer [10]. It is a dynamic analysis 
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framework that (i) enables programs to define tags for data values and check for specific 
tags, both through its API functions, and (ii) propagates all tags with the data. 


7 Conclusion 


We have presented a new dynamic analysis tool, Endicheck, for detecting endianness 
bugs in C/C++ programs. The tool is built upon the Valgrind framework. Endicheck pro- 
vides a useful, and in many settings also preferable, alternative to static analysis tools 
like Sparse, because (1) it reports quite precise results (i.e., a low number of false warn- 
ings) due to the nature of dynamic analysis and (2) requires less annotations (and other 
changes) in the source code of the subject program in order to be able to detect missing 
byte-swap operations. The results of our experimental evaluation show that Endicheck 
can (1) handle large complex programs and (2) identify actual endianness bugs, and it 
has practical performance overhead. Endicheck could also be used in automated test- 
ing scenarios, as a useful alternative to testing programs on both little- and big-endian 
processor architecture. A testing environment based on Endicheck might be easier to 
set-up than the environment based, for example, on virtual machines. 


7.1 Future Work 


Possible extensions of Endicheck, which could improve its precision and practical use- 
fulness even further, include: 


— More complex analysis approach based on explicit tagging of each byte in an inte- 
ger data value with its position. 

Reporting arithmetic instructions that use data with target endianness. 
Automatically checking system calls such as write for correct endianness. 
Suppression files for endianness bug reports to eliminate false positives. 


Another way to detect endianness bugs more precisely is to use comparative runs 
(i.e, a kind of equivalence checking). The key idea is to run a program on two machines, 
where one has a big-endian architecture and the other has a little-endian architecture, 
and compare the data leaving both variants of the program. This approach has the po- 
tential to be the most accurate, because it can even detect problems in cases when data 
leaving the program are encrypted or compressed. On the other hand, it cannot always 
detect situations when the program forgets to byte-swap input data, unless the error 
affects one of the output values with concrete endianness. 
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Abstract. We present a programming language for describing and analysing 
concurrent quantum systems. We have an interpreter for programs in the 
language, using a symbolic rather than a numeric calculator, and we give 

its performance on examples from quantum communication and cryptog- 
raphy. 


Quantum cryptographic protocols such as BB84 QKD [3] and E92 QKD [7] 
offer unconditional statistical security. These protocols have been implemented in 
commercial products; various QKD networks have been built around the world; 
and China has launched a dedicated satellite for quantum communication. The 
security of the protocols has been established information-theoretically, but their 
implementations may have security loopholes. We intend to investigate the se- 
curity question, eventually by using formal methods to verify the properties of 
implementations, but first by simulation of protocols expressed as programs. 

Large companies are developing full-stack solutions for implementing quan- 
tum algorithms, and quantum computers will likely be network-linked. Although 
we have focused on quantum communication and cryptography protocols, as- 
pects of our work will be applicable to distributed quantum computation. 

Concurrent quantum systems, such as communication and cryptographic pro- 
tocols assume physically-separated agents (Alice, Bob, etc.) who communicate 
by sending each other qubits (quantum bits: polarised photons, for example) 
and classical bit-strings. There are a few dedicated, high-level programming lan- 
guages for quantum systems such as Microsoft's Q# [2]. They focus on single- 
machine computation and lack a treatment of communication, but a protocol 
simulation must ensure, for example, that a qubit transferred from one agent 
to another can't be used again by the sender and can't be used by the receiver 
before it is sent. We decided therefore to take a process-calculus approach, and 
we have implemented a tool inspired by CQP [9]. Our implementation is called 
qtpi [1], and uses symbolic rather than numeric quantum calculation. Programs 
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are checked statically, before they run, to ensure that they obey real-world re- 
strictions on the use of qubits (no cloning, no sharing). Unlike CQP, which 
preserves all possible outcomes, labelling each with a probability, qtpi takes a 
single execution path, making probabilistic choices between outcomes. 

We have used qtpi to simulate simple protocols such as teleportation, and 
some more involved ones including the quantum key-distribution protocols BB84 
[3] and E92 [7]. Each of these involves transmission of qubits and public trans- 
mission of classical messages (in the case of BB84, over an authenticated channel 
[13]), all of which is simulated. It is early days in our development of the tool, 
so there is as yet no provision for formal proof, but in these examples we can 
already simulate well over 1M qubit transfers per minute on a small laptop — i.e. 
we can simulate largish examples in a useful time. 


1 Processes 


Protocols are carried out by agents which send each other messages but share 
no other information. We simulate agents by processes which share no data or 
variables. Typical protocol steps from the literature are 


— obtain a qubit, perhaps initialised to one of |0), |1), |+) or |); 
— put a qubit through a gate such as I, H, X, etc.; 

— measure a qubit; 

— send or receive a qubit; 

— send or receive a classical value, such as a list of numbers or bits. 


In addition an agent may perform a calculation, such as generating 1000 ran- 
dom bits or encrypting/decrypting a message or checking the values received in a 
message. Calculations aren't protocol steps and don't affect qubit state, though 
they often depend on the results of measuring qubits and their results often 
influence subsequent protocol steps. Our processes have analogues of protocol 
steps and calculations. In addition we are able to create processes, to choose 
conditionally between different processes and to set up a collection of processes 
running simultaneously. 

'The aim of our work is to mathematically analyse programs which describe 
quantum systems. Towards that end we have a semantics of quantum-mechanical 
calculation [5], written in Coq [10]. That is work in progress: for the time being 
we are able to execute our protocol-programs using our simulator [1]. 


1.1 A programming language 


Our language has two distinct notations: a protocol-step language, which is de- 
rived from the pi-calculus [11], and a functional calculation language, somewhat 
in the style of Miranda [12]. Neither language has assignment, although qubit 
measurement does change program state and so needs special attention. The 
protocol-step language has recursion, but only tail recursion: i.e. nothing can fol- 
low a process invocation step (but note that parallel execution of sub-processes 
provides more complexity). 
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Following the pi-calculus we use channels to communicate between processes. 
So Alice doesn’t send to Bob, she sends down a channel which Bob can read from 
— or perhaps it might be Eve, if there is interference. Channels are values, so 
you can set up communication between two processes by giving them the same 
channel-argument when you create them, and you can send channel values in 
messages to alter connections dynamically. 

In the protocol-step language steps are separated by dots (‘.’) and choices 
are made between processes rather than single or multiple steps. Channels are 
created by (new c); send is CIE, .., E; receive is C? (z, .., ); qubits are created 
by (newq q); quantum gating is Q,..,Q>>G; quantum measurement Q— /—(). 

In the expression language there is function application (f arg), arithmetic 
and Boolean calculation, conditional choice and recursion. It uses infinite-precision 
rationals for numerical calculations. 


1.2 Symbolic quantum calculation 


Quantum calculations can be described using quantum circuits: diagrams such 
as Fig. 1 show how qubits (one per line) are put through gates (boxes, line- 
connectors) and/or measured (meter symbols) giving a classical 0/1 result. 

In quantum mechanics the state of a qubit is a vector a|0) + b|1), with 
|a|? + |b|? = 1. Here |0) and |1) are the computational basis vectors, a and b are 
complex amplitudes, and |a|? and |b|? give the probability of measuring the state 
as |0) or |1). In qtpi a single isolated qubit is therefore a pair of complex numbers, 
and quantum gates, such as the H, X and Z gates of Fig. 1, are square matrices 
of complex numbers which modify the state by multiplication. The state of n 
entangled qubits is a 2"-element vector, matrices which manipulate all of it have 
to be 2" x 2", so calculations with large entanglements can rapidly grow out 
of the range of straightforward simulation. Luckily, quantum security protocols 
typically work with a small number of qubits at a time. 

Because our calculations are simple, we can afford to implement them sym- 
bolically. We use h for \/1/2; it is also equal to sin (7/4) and cos (7/4). A great 
deal of formulae can be expressed in terms of powers of h: for example cos (7/8) 
= /(1+h)/2. 

Symbolic calculation involves lots of symbolic simplification. That makes it 
relatively slow, compared to calculation with floating-point numbers, but it is 
absolutely accurate — h? + h?, for example, is exactly 1. When measuring, we 
must convert symbolic probabilities into numbers. But that is part of a statistical 
calculation, so minor inaccuracy is acceptable. 


z |v) H 
NN 


y |0) | XHZ lW) 


Fig. 1. Quantum circuit for teleportation 
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proc System () = 
(newq x=|+>, y=|0>) x,y>>CNot . 
(new c:^bit*bit) | Alice(x,c) | Bob(y,c) 


proc Alice (x:qbit, c:^bit*bit) = 


(newq z) 
out!["initially Alice’s z is "] . outq!(qval z) . out! ["\n"] 
z,x»»CNot . z>>H . z-/-(vz) . x-/-(vx) . c!vz,vx . _0 


proc Bob(y:qbit, c:^bit*bit) - 
c? (b1,b2) 
y >> match bi,b2 . + ObO,0bO . I 
+ ObO,0bi . X 
+ 0b1,0b0 .Z 
+ Ob1,0b1 . Z*X . 
out! ["finally Bob’s y is "] . outq!(qval y) . out!["\n"] . _0O 


Fig. 2. Teleportation of an unknown quantum state, with logging 


1.3 No cloning 


In the real quantum world there is no way of cloning a qubit — you can't start 
with a qubit in some arbitrary state and finish up with two qubits in that state. 
That, plus the fact that measurement irrevocably alters a qubit's state, is what 
provides quantum security protocols with unconditional security — though the 
uncertainty of measurement means that the guarantee is probabilistic, not abso- 
lute. A programming language which simulates quantum effects should therefore 
not allow copying of the value of a qubit variable. We use language restrictions to 
facilitate anti-cloning checks: in particular we severely restrict the use of qubits 
in data structures, in messages, and after measurement or transmission. Those 
checks are partly implemented by typechecking, partly by an efficient static sym- 
bolic execution before simulation begins. 


1.4 Other notable features 


Randomised priority queues of runnable processes and waiting communication 
offers ensure non-deterministic execution, and are used to eliminate infinite un- 
fairness. Logging steps can be pushed into subprocesses to clarify protocol de- 
scriptions, leaving a marker in the logged process to show where it should occur 
(see examples in artifact [6]). Type descriptions are almost entirely optional. 


2 Straightforward description 


Our aim is to provide a programming language in which protocol descriptions 
are transparently easy to read. For example, Fig. 2 shows teleportation [4] using 
three processes: Alice and Bob carry out the protocol, and System sets up the 
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communication between them. The calculation follows the circuit in Fig. 1, but 
is shared between agents obeying the anti-cloning restrictions. 

The System process creates qubits r and y (newq ..), initialised to |+) and 
|0), and entangles them using a CNot gate (x,y>> ..). It creates a channel c 
which carries pairs of bits (new c ..), and then splits into two subprocesses: one 
becomes Alice, taking one of the qubits and the channel; the other becomes Bob, 
with the other qubit and the same channel. Those processes run in parallel. 

The Alice process creates a new qubit z, without specifying its state, and logs 
that state (the anti-cloning restrictions make this tricky). Then it puts z and x 
through a CNot gate (z,x>> ..), puts z alone through a Hadamard gate (z>>H), 
and finally measures first z (z-/-(vz)), then x (x-/-(vx)), giving bits vz and 
va. Finally it sends those bits to Bob on the c channel (c!...). The overall effect 
is subtle, because first System's actions entangle x and y, so that measurement 
of x constrains y, and then Alice entangles z, x and y, so that measurement of 
z constrains both x and y. 

The Bob process waits to receive Alice's message (c? ..), and calculates a 
gate (match ..) to process the results depending on one of four possibilities for 
the two bits it receives (note one of the gates is the matrix product of Z and 
X). It puts y through that gate (y>> ..) and logs the result. The output of this 
program is always 


initially Alice’s z is 2:(a2|0>+b2|1>) 
finally Bob’s y is 1:(a2|0>+b2|1>) 


where az and bə are unknown symbolic amplitudes. A sample execution trace, 
edited for brevity, shows the states produced by Alice's actions: qubit 0 is z, 1 
is y, 2 is z; initially 0 and 1 are entangled, and the first step entangles all three. 


Alice (2:(a2|0»4b2|1»),0:[0;1] (h]00>+h|11>)) >> Cnot; 
result (2:[2;0;1] (h*xa2|000»*h*a2|011»-h*b2|101»-h*b2|110»), 
0:[2;0;1] (h*a2|000>+h*a2|011>+h*b2|101>+h*b2/110>)) 
Alice 2:[2;0;1] (h*a2|000>+h*a2|011>+h*b2|101>+h*b2|110>) >> H; 
result 2:[2;0;1] 
(h(2) *a2|000>+h (2) *b2|]001»*h (2) *b2|]010»*h(2)*a2]|011» 
t*h(2)*a2|100»-h(2)*b2|101»-h(2)*b2|110»4-h (2) *a2| 111») 
Alice: 2: (.. as above ..) -/- ; 
result O and (0:[0;1] (h*a2|00>+h*b2|01>+h*b2|10>+h*a2|11>), 
1: [0; 1] Ch*a2|00>+h*b2|01>+h*b2|10>+h*a2|11>)) 
Alice: 0:[0;1] (h*a2|00>+h*b2|01>+h*b2[10>+h*a2|11>) -/- ; 
result 1 and 1: (b2|0>+a2|1>) 
Chan 2: Alice -> Bob (0,1) 
Bob 1:(b2|0>+a2|1>) >> X; result 1:(a2|0>+b2|1>) 


Tracing several executions shows that Alice's measurements don’t always give the 
same results in vz, vz and qubit 1, so Bob doesn’t always use the same gate(s). 
The qubit z is never sent in a message, is destroyed by Alice's measurement, and 
its amplitudes are unknown to the program, but y always finishes up in the state 
that z began in. Without symbolic calculation we couldn't do such a simulation. 
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3 Performance on examples 


We can run various simulations of the quantum key-distribution protocol BB84 
[3], with Alice and Bob and various Eve processes. In order to generate a one- 
time key to encrypt an n-bit message, Alice needs to send many more bits than 
n, and our simulation allows us to experiment with various parameters of her 
calculation to see what happens. Here is a shortened display of part of the output 
of an example simulation (timing measurements made on VirtualBox Ubuntu 
18.10, on a 7-year-old MacBook Air with 8GB RAM): 


length of message? 4000; length of a hash key? 40; 
minimum number of checkbits? 500; number of sigmas? 10; 
number of trials? 100 


13718 qubits per trial; 0 interfered with; 100 succeeded 


It takes about 0.6 seconds for each trial, but overall it makes 1.3M qubit transfers 
and measurements in 60 CPU seconds. With an intercept-and-resend Eve, the 
same exchanges take 95 seconds, but Eve’s interference is detected every time. 
With a very short message and very few checkbits we can show that even such 
a naive Eve can sometimes win, as statistical analysis predicts. 

Our simulation of E92 QKD [7] uses 20 000 entangled qubit pairs per trial for 
the same-sized problem. Because the protocol calculations are more complicated 
and our calculation language is interpreted rather than compiled, simulation 
takes over 4 CPU minutes. 

Qtpi can handle larger entanglements. In about 13 seconds it’s able to set up 
and measure one ‘brick’ (ten qubits, all CZ-entangled) of the measurement-based 
quantum computing mechanism in [8] — but that’s too small to be useful, and 
larger entanglements are exponentially worse. 


4 Conclusions 


We have a quantum programming language which allows description of protocols 
with multiple agents. It has protection, built from well-understood computer 
science foundations, against cloning of qubits within a simulation. It is not yet 
able to deal efficiently with entanglements of more than a few qubits. Its symbolic 
calculator is fast enough for the protocols we have examined. 
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Abstract Session types provide a principled programming discipline for 
structured interactions. They represent a wide spectrum of type-systems 
for concurrency. Their type safety is thus extremely important. EMTST 
is a tool to aid in representing and validating theorems about session 
types in the Coq proof assistant. On paper, these proofs are often tricky, 
and error prone. In proof assistants, they are typically long and difficult 
to prove. In this work, we propose a library that helps validate the theory 
of session types calculi in proof assistants. As a case study, we study two 
of the most used binary session types systems: we show the impossibility 
of representing the first system in a-equivalent representations, and we 
prove type preservation for the revisited system. We develop our tool 
in the Coq proof assistant, using locally nameless for binders and small 
scale reflection to simplify the handling of linear typing environments. 


Keywords: Concurrency - proof assistants - meta-theory - session-types. 


1 Introduction 


Given the prevalence of distributed computing and multi-core processors, con- 
currency is a key aspect of modern computing. The transition from sequential 
models of computation to concurrent systems has huge practical and theoret- 
ical consequences. Message passing calculi (like the 7-calculus) have been used 
to model these systems since their introduction by Milner et al. [15]. Notably, 
in many cases typing disciplines are used as a way to control concurrent and 
distributed behaviour. Certifying basic typed 7-calculi is important for both the 
safety of implementations and the trustworthiness of new theories. 

In this work, we concentrate on providing tools for reasoning about session 
types [10], a typing discipline for structured interactions in distributed systems. 
Session types are applied to a wide range of problems, and their properties, such 
as deadlock-freedom, are well studied. These calculi are very expressive, and 
rather complex, with features like: shared and linear communication channels, 
name passing, and fresh name generation. Given this complexity, it is not sur- 
prising that some innocent looking extensions violated the type safety properties 
of the calculus in several literature (as pointed out by [23]). In consequence, the 
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interest for mechanisation and formal proofs has risen significantly as a means 
to increase the trust on systems. 

Type systems offer certain security properties by construction. These guaran- 
tees are backed by rigorous proofs (these proofs conform the meta-theory of the 
system). Moreover, these proofs are cumbersome to write, maintain and extend. 
Proof assistants aim to help with these problems. In this work, we develop the 
EMTST library to aid in the implementation of session calculi type systems. 
As a form of validation, we implement and replicate results in the meta-theory 
of binary session types. Concretely, we use the Coq proof assistant [20] to study 
the representation and meta-theory of the two systems described in [23]. 

EMTST uses locally nameless (LN)[1, 5] variable binders to represent syntax. 
The tool implements a LN library with extended support for multiple binding 
scopes, a robust environment implementation suitable for the challenges of ses- 
sion typing disciplines. The library and lemmas are written taking advantage of 
boolean reflection through the use of the Ssreflect [7] library. 

We implement two case studies from [23]. The first study that we refer to 
as the original system and the second that we refer to as the revised systems. 
Notably, the way the original system handles names (in Sect. 3.1), makes its 
representation impossible when using intrinsically a-convertible terms (e.g: loc- 
ally nameless, de Bruijn indices, and many others). Furthermore in Sect. 3.2, we 
discuss how the revised system allows us to implement and prove type preser- 
vation. In hindsight, this problem appears as evident, but it is an unexpected 
consequence, and it shows that mechanising proofs brings further understanding 
even to well-established and thoroughly studied systems. EMTST and our case 
studies are available at https: //github.com/emtst /emtst-proof. 

The rest of the paper is structured in the following way: in the next section 
we introduce the ideas and design behind EMTST our library for mechanising 
the meta-theory of session types. Subsequently in Sect. 3, we present the two case 
studies: in Sect. 3.1 the original system from [23, 11] and the revisited system in 
Sect. 3.2. We finalise, by giving a conclusion and related work. 


2 EMTST: a Tool for Representing the Meta-theory of 
Session Types 


The study of meta-theory (i.e: proving a system has the expected properties) 
gives us confidence in the design. Additionally, proof formalisations, not only 
give us confidence in the results, but also often result in new insights about 
a problem. This is due to the fact that successful mechanisations require very 
precise specifications and careful thought to define and revisit all the concepts. 
In this context, EMTST is a tool that implements locally nameless (initially 
proposed by [8, 14, 13], and more recently further developed in [1, 5]) with 
multiple binding scopes, and a robust typing environment implementation using 
boolean reflection (by building on top of ssreflect [7]). 

The key concept of LN is to use de Bruijn indices [2] for bound variables 
and names (sometimes called “atoms” in the literature) for free variables. A 
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representation of syntax is well formed, namely locally closed, when this invariant 
is respected (i.e.: no de Bruijn index is free). Finally, in order to deal with open 
terms, there are two convenient operations on syntax, one is to open binders in 
terms, and one to close binders. The former substitutes a bound variable with a 
fresh name, and the other does the converse. For more details, refer to our tech 


report [4], the references, and the implementation. 


2.1 Environments and Multiple Name Scopes 


Module Type ATOM. 
Parameter atom : Set. 
Definition t :— atom. 


(* atoms can be compared to booleans *) 

Parameter eq.atom : atom — atom — bool. 

Parameter eq reflect : V (a b : atom), 
ssrbool.reflect (a = b) (eq_atom a b). 

Parameter atom_eqMixin : Equality.mixin_of atom. 


Canonical atom. eq Type :— EqType atom atom .egMixin. 


Parameter fresh : seq atom — atom. 


Locally nameless imple- 
mentation is in three files. 
The first (theories/Atom.v) 
provides the basic definition 
and specification of atoms to 
act as names, the second one 
(theories/AtomScopes.v) 

provides a way to create mul- 


tiple disjoint sets of names for 
representing variables in the 
different scopes that session 
types require (e.g. variables 
and channel names), and the final one (theories/Env.v) implements contexts 
and typings as finite maps, with emphasis on supporting the linearity require- 
ments of various session typing disciplines. 

We use module types and parametrised modules to abstract the type of 
atoms together with their supported operations. Figure 1 shows the interface for 
working with atoms: how to compare them and functions to obtain a fresh atom 
given a finite sequence of atoms (definition: fresh), and to have proof that the 
fresh atom is actually fresh (definition: fresh not in). 


Parameter fresh not. in : V1, (fresh 1) ¢ 1. 


(# ... x) 
End ATOM. 


Figure 1. The type of atoms 


Environments. Environments are parametrised over two types, one for the 
keys, and one for the type of values. Environments env are either undefined, or 
a finite map of unique keys and values. All the operations keep the invariant 
that any operation that would lead to a duplicated entry key makes the tree 
undefined. We define the expected operations and lemmas over the type env. 
We provide an extensive library of proved theorems about environments that is 
tailored to support linear and affine systems. 

EMTST is used in the two formalisations in Sect. 3.1 and 3.2 and we claim 
they are also suitable for other mechanisations where resource sensitivity and 
locally nameless are required. A release version of EMTST is available at [3] 
and the public repository at: https://github.com/emtst /emtst-proof. 


3 Two Case Studies on Binary Session Types 


EMTST is intended to help with the complex binding structure of concurrent 
calculi that have names as a first class notion together with linear or affine typing 
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disciplines. We study two seminal session type systems in the literature. First 
the original system, from Honda, Vasconcelos and Kubo’s binary session type 
system [11] that is a milestone in the development of type systems for concurrent 
process calculi. This system types structured interaction between processes and 
supports channel mobility, that is higher-order sessions. Second, we implement 
the revisited session type presentation from [23], inspired by [6]. Our technical 
report [4] contains an extensive presentation. 


3.1 The Original System 


Process P,Q, R ::= 


request a(k).P session request | if e then P else Q conditional 
accept a(k).P session accept | P | Q parallel 
k ![e]; P data send | inact inaction 
k?(z).P data receive | vp (a).P name hiding 
kam;P selection | ve (k).P channel hiding 
kb (1:P[r:Q]l branching | !P replication 
throw k [k’]; P channel send 

catch k(k').P channel receive 

e :— true | false |... expression mi=1|r labels 


Figure 2. Syntax using names 


Figure 2 presents the syntax following [23], where names are ranged by 
à, b, c,..., channels are ranged by k and k’. Notice that all the places where 
there are variable binders are denoted with parenthesis followed by a dot (e.g: 
k ?(z).P). The syntax is straightforwardly defined as the proc inductive type in 
theories/Syntax0.v and following the LN technique the locally closed predic- 
ate, that formalises the binding structure, is defined as the predicate 1c. 

Besides its syntax, the original system is specified by its reduction, congru- 
ence and typing relations. We want to call attention to an important reduction 
rule for passing names: 


[Pass-N] throw k [|k]; P | catch k (k’).Q — P| Q 


'This rule states that when passing a channel k'the receiving end has to bind a 
channel using the same name (or be a-convertible to that name). Notoriously, the 
name k’is a bound name in the receiving end, and the restriction imposed by the 
rule is a subtle change to the equality up-to a-conversion convention. Moreover, 
relaxations of that requirement may break subject reduction, a complete discus- 
sion is presented in Sect. 3 of [23]. As it is, this rule cannot be formalised in 
a representation that cannot distinguish between a-equivalent terms. Since in 
these representations, one cannot talk about the actual name of a bound variable. 
This is fundamentally what it means to be up-to a-equality. As a consequence, 
in locally nameless we are forced to specify the following rule: 


le? body Q 


[Pass-LN] ; 
throw k [k']; P | catch k().Q — P | Q* 
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In this version of the rule, the bound name is just an anonymous de Bruijn 
index, and when it is opened it is assigned the same name k’. This change might 
look innocent, but it breaks subject reduction. In theories/TypesO.v, we show 
that the same counter example from [23] is typable and that it breaks subject 
reduction. This is presented in the CounterExample module and in the oft. reduced 
lemma. In the next section, we discuss how this problem was addressed. 


3.2 The Revised System 


As discussed in Sect. 3.1 and [23], the presentation of the original session types 
calculus [11] makes extending it (and representing it in LN) a delicate opera- 
tion. Fortunately, the revised system (also from [23], inspired by [6]) proposes 
a solution. Indeed, this solution is readily implementable using LN (and many 
other representations with implicit a-equivalence). 

'The key insight in the design of the revisited system is considering channel 
endpoints instead of just channels. As before, a new channel is created when a 
requested session is accepted, and each continuation gets one of the endpoints of 
the newly created channel. 


Inductive proc : Set := 

request : scvar — proc — proc 
accept: scvar — proc — proc 

send : channel — exp — proc — proc 
receive : channel — proc — proc 


Legend 
select : 
channel — label — proc — proc 
Eid m x S proc process binds variable from Asc 
channe proc proc proc Deme 
threw: proc process binds variable from Agy 
Eun. S ric proe proc process binds variable from Arc 
proc process binds variable from Acy 


ife: exp — proc — proc — proc .— Seem 
par: proc — proc — proc 

inact : proc 

nu_ch : proc — proc (* hides a channel name x) 


bang: proc — proc (* process replication *) 


Figure 3. Syntax representation annotated with binders 


For the revisited system’s formalisation we distinguish binders in four cat- 
egories (as shown in Figure 3): First, expression variables, with names from the 
set Agy, then shared channel variables from Asc, also linear channel variables 
from Arc, and finally channel names from Agy (these names can also be bound 
in restrictions). Channel names are not variables, but objects that exist at run- 
time. 
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Multiple disjoint sets of names simplify reasoning about free names (con- 
cretely, it avoids freshness problems among different kinds of binders). This is 
an engineering compromise, as having more binders duplicates some easy theor- 
ems but, in exchange, they simplify the harder theorems that rely on facts about 
LN open/close operations. Other compromises are possible. 

This concludes the technical development, and represents a full proof of sub- 
ject reduction for binary types, following the revised system! as defined in [23]. 


4 Related Work and Conclusions 


We presented EMTST, a tool conceived to aid in the mechanisation of session 
calculi. Our tool supports locally nameless representations with many disjoint 
atom scopes, and a versatile representation of environments. All while taking 
advantage of the small scale reflection style of proofs. We validated our design 
by formalising the subject reduction proof for a full session calculus type sys- 
tem. And, we explored issues with adequacy when, for example, systems contain 
fragile specifications. 

Tools like Metalib [22] (implemented based on [1]) and AutoSubst [18] exist, 
but lack the ability to represent different binding scopes in the same syntax. 
Also, Polonowski [17] implements a library for generic environments, while this 
library is similar to ours, it does not make use of boolean reflection, that, in 
our opinion simplifies dealing with the equality of environments. While these 
libraries were influential, our requirements of multiple scopes of binding and 
boolean reflection proofs, means that we needed to develop EMTST, our own 
fit for purpose library. 

Finally, formalisations of session types in proof assistants exist in the literat- 
ure (e.g.: [21, 24, 19, 16, 9]). Most of them with ad-hoc binder representations. 
They are not necessarily meant to be reused or general enough for other devel- 
opments. This paper, and the EMTST library are a step towards helping this 
become easier. For that purpose we developed the library and validated its claims 
by formalising existing systems from the literature. In the process (see Sect. 3.1 
vs Sect. 3.2), we motivate how early mechanisation would help avoid problems 
in the presentation of a system. In the future, we plan to extend our use of the 
library to reason about multiparty session types [12] and other systems. 
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Abstract. We propose a novel algorithm for the solution of mean-payoff 
games that merges together two seemingly unrelated concepts introduced 
in the context of parity games, small progress measures and quasi do- 
minions. We show that the integration of the two notions can be highly 
beneficial and significantly speeds up convergence to the problem solution. 
Experiments show that the resulting algorithm performs orders of mag- 
nitude better than the asymptotically-best solution algorithm currently 
known, without sacrificing on the worst-case complexity. 


1 Introduction 


In this article we consider the problem of solving mean-payoff games, namely 
infinite-duration perfect-information two-player games played on weighted di- 
rected graphs, each of whose vertexes is controlled by one of the two players. The 
game starts at an arbitrary vertex and, during its evolution, each player can take 
moves at the vertexes it controls, by choosing one of the outgoing edges. The 
moves selected by the two players induce an infinite sequence of vertices, called 
play. The payoff of any prefix of a play is the sum of the weights of its edges. A 
play is winning if it satisfies the game objective, called mean-payoff objective, 
which requires that the limit of the mean payoff, taken over the prefixes lengths, 
never falls below a given threshold v. 

Mean-payoff games have been first introduced and studied by Ehrenfeucht 
and Mycielski in [20], who showed that positional strategies suffice to obtain 
the optimal value. A slightly generalized version was also considered by Gur- 
vich et al. in [4]. Positional determinacy entails that the decision problem for 
these games lies in NPTIMEN CONPTIME [34], and it was later shown to belong 
to UPTIME N COUPTIME [5], being UPTIME the class of unambiguous non- 
deterministic polynomial time. This result gives the problem a rather peculiar 
complexity status, shared by very few other problems, such as integer factoriza- 
tion 2]. and parity games [5]. Despite various attempts [7]9]p]po]B4], no 
polynomial-time algorithm for the mean-payoff game problems is known so far. 

A different formulation of the game objective allows to define another class of 
quantitative games, known as energy games. The energy objective requires that, 
given an initial value c, called credit, the sum of c and the payoff of every prefix 
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of the play never falls below 0. These games, however, are tightly connected to 
mean-payoff games, as the two type of games have been proved to be log-space 
equivalent [11]. They are also related to other more complex forms of quantitative 
games. In particular, unambiguous polynomial-time reductions exist from 
these games to discounted payoff and simple stochastic MESE 

Recently, a fair amount of work in formal verification has been directed to con- 
sider, besides correctness properties of computational systems, also quantitative 
specifications, in order to express performance measures and resource require- 
ments, such as quality of service, bandwidth and power consumption and, more 
generally, bounded resources. Mean-payoff and energy games also have important 
practical applications in system verification and synthesis. In the authors 
show how quantitative aspects, interpreted as penalties and rewards associated 
to the system choices, allow for expressing optimality requirements encoded as 
mean-payoff objectives for the automatic synthesis of systems that also satisfy 
parity objectives. With similar application contexts in mind, (9| and |8| further 
contribute to that effort, by providing complexity results and practical solutions 
for the verification and automatic synthesis of reactive systems from quantitative 
specifications expressed in linear time temporal logic extended with mean-payoff 
and energy objectives. Further applications to temporal networks have been 
studied in and [15]. Consequently, efficient algorithms to solve mean-payoff 
games become essential ingredients to tackle these problems in practice. 

Several algorithms have been devised in the past for the solution of the decision 
problem for mean-payoff games, which asks whether there exists a strategy for one 
of the players that grants the mean-payoff objective. The very first deterministic 
algorithm was proposed in (34], where it is shown that the problem can be solved 
with O(n3 m. W) arithmetic operations, with n and m the number of positions 
and moves, respectively, and W the maximal absolute weight in the game. A 
strategy improvement approach, based on iteratively adjusting a randomly chosen 
initial strategy for one player until a winning strategy is obtained, is presented 
in [31]. which has an exponential upper bound. The algorithm by Lifshits and 
Pavlov [29]. which runs in time O(n -m - 2" - log; W), computes the “potential” 
of each game position, which corresponds to the initial credit that the player 
needs in order to win the game from that position. Algorithms based on the 
solution of linear feasibility problems over the tropical semiring have been also 
provided in Bl. The best known deterministic algorithm to date, which requires 
O(n -m - W) arithmetic operations, was proposed by Brim et al. . They adapt 
to energy and mean-payoff games the notion of progress measures , as applied 
to parity games in [6]. The approach was further developed in to obtain the 
same complexity bound for the optimal strategy synthesis problem. A strategy- 
improvement refinement of this technique has been introduced in [12]. Finally, 
Bjork et al. [6] proposed a randomized strategy-improvement based algorithm 
running in time min{O(n? -m - w), 20(v 710g Bh 

Our contribution is a novel mean-payoff progress measure approach that 
enriches such measures with the notion of quasi dominions, originally introduced 
in for parity games. These are sets of positions with the property that as 
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long as the opponent chooses to play to remain in the set, it loses the game for 
sure, hence its best choice is always to try to escape. A quasi dominion from 
where escaping is not possible is a winning set for the other player. Progress 
measure approaches, such as the one of [13], typically focus on finding the best 
choices of the opponent and little information is gathered on the other player. 
In this sense, they are intrinsically asymmetric. Enriching the approach with 
quasi dominions can be viewed as a way to also encode the best choices of the 
player, information that can be exploited to speed up convergence significantly. 
'The main difficulty here is that suitable lift operators in the new setting do 
not enjoy monotonicity. Such a property makes proving completeness of classic 
progress measure approaches almost straightforward, as monotonic operators do 
admit a least fixpoint. Instead, the lift operator we propose is only inflationary 
(specifically, non-decreasing) and, while still admitting fixpoints [10]33], need 
not have a least one. Hence, providing a complete solution algorithm proves 
more challenging. The advantages, however, are significant. On the one hand, 
the new algorithm still enjoys the same worst-case complexity of the best known 
algorithm for the problem proposed in [13]. On the other hand, we show that 
there exist families of games on which the classic approach requires a number of 
operations that can be made arbitrarily larger than the one required by the new 
approach. Experimental results also witness the fact that this phenomenon is by 
no means isolated, as the new algorithm performs orders of magnitude better 
than the algorithm developed in [13]. 


2 Mean-Payoff Games 


A two-player turn-based arena is a tuple A = (Psa, Psg, Mv), with Psg OPs () 
and Ps £ Ps U Psg, such that (Ps, Mv) is a finite directed graph without sinks. 
Psa (resp., Psg) is the set of positions of player & (resp., A) and Mv C Ps x Ps 
is a left-total relation describing all possible moves. A path in V C Ps is a finite 
or infinite sequence v € Pth(V) of positions in V compatible with the move 
relation, i.e., (7;,7:;41) € Mv, for all i € (0,|x| — 1). A positional strategy for 
player a € {9, H} on V C Ps is a function oa € Str,(V) C (VM Psa) > Ps, 
mapping each a-position v in the domain of c, to position es (v) compatible 
with the move relation, i.e., (v,7a(v)) € Mv. With Stra(V) we denote the set 
of all a-strategies on V, while Str, denotes Uycp, Stra(V). A play in V C Ps 
from a position v € V w.r.t. a pair of strategies (og, og) € Stra(V) x Stra(V), 
called ((o, og), v)-play, is a path 7 € Pth(V) such that 7, = v and, for all 
i € [0,| 1| — 1), if m; € Psg then mi41 = ce (;) else v;41 = og(n;). The play 
function play : (Stra (V) x Stra(V)) x V 2 Pth(V) returns, for each position v € V 
and pair of strategies (og, og) € Stre(V) x Stra(V), the maximal ((o0g, og), v)- 
play play((oo, og), v). If a pair (og,0g) € Stra(V) x Stra(V) induces a finite 
play starting from position v € V, then play((c@, og), v) identifies the maximal 
prefix of that play that is contained in V. 

A mean-payoff game (MPG for short) is a tuple Ð — (.A4, Wg, wg), where A 
is an arena, Wg C Z is a finite set of integer weights, and wg: Ps > Wg isa 
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weight function assigning a weight to each position. Pst (resp., Ps) denotes 
the set of positive-weight positions (resp., non-positive-weight positions). For 
convenience, we shall refer to non-positive weights as negative weights. Notice 
that this definition of MPG is equivalent to the classic formulation in which the 
weights label the moves, instead. The weight function naturally extends to paths, 
by setting wg(7) £ pua wg(7;). The goal of player & (resp., H) is to maximize 


(resp., minimize) v(t) = lim inf; 4; + - wg(m<i), where m«; is the prefix up to 
index i. Given a threshold v, a set of positions V C Ps is a @-dominion, if there 
exists a @-strategy og € Stre(V) such that, for all H-strategies og € Stra(V) 
and positions v € V, the induced play m = play( (o, og), v) satisfies v(m) > v. 
The pair of winning regions (Wng, Wng) forms a v-mean partition. Assuming 
v integer, the v-mean partition problem is equivalent to the 0-mean partition 
one, as we can subtract v to the weights of all the positions. As a consequence, 
the MPG decision problem can be equivalently restated as deciding whether 
player © (resp., H) has a strategy to enforce lim inf; 5; i : Wg(T«;) > 0 (resp., 
lim inf; , ss i - wg(7«;) < 0), for all the resulting plays r. 


3 Solving Mean-Payoff Games via Progress Measures 


'The abstract notion of progress measure has been introduced as a way to 
encode global properties on paths of a graph by means of simpler local properties 
of adjacent vertexes. In the context of MPGs, the graph property of interest, 
called mean-payoff property, requires that the mean payoff of every infinite path in 
the graph be non-positive. More precisely, in game theoretic terms, a mean-payoff 
progress measure witnesses the existence of strategy og for player H such that 
each path in the graph induced by fixing that strategy on the arena satisfies the 
desired property. A mean-payoff progress measure associates with each vertex 
of the underlying graph a value, called measure, taken from the set of extended 
natural numbers Næ = NU {oo}, endowed with an ordering relation < and an 
addition operation +, which extend the standard ordering and addition over the 
naturals in the usual way. Measures are associated with positions in the game 
and the measure of a position v can intuitively be interpreted as an estimate 
of the payoff that player © can enforce on the plays starting in v. In this sense, 
they measure “how far" v is from satisfying the mean-payoff property, with the 
maximal measure oo denoting failure of the property for v. More precisely, the 
-strategy induced by a progress measure ensures that measures do not increase 
along the paths of the induced graph. This ensures that every path eventually 
gets trapped in a non-positive-weight cycle, witnessing a win for player 

To obtain a progress measure, one starts from some suitable asc of 
position of the game with measures. The local information encoded by these 
measures is then propagated back along the edges of the underlying graph so 
as to associate with each position the information gathered along plays of some 
finite length starting from that position. The propagation process is performed 
according to the following intuition. The measures of positions adjacent to v 
are propagated back to v only if those measures push v further away from the 
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property. This propagation is achieved by means of a measure stretch operation 
+, which adds, when appropriate, the measure of an adjacent position to the 
weight of a given position. This is established by comparing the measure of v 
with those of its adjacent positions, since, for each position v, the mean-payoff 
property is defined in terms of the sum of the weights encountered along the plays 
from that position. The process ends when no position can be pushed further 
away from the property and each position is not dominated by any, respectively 
one, of its adjacents, depending on whether that position belongs to player © 
or to player H, respectively. The positions that did not reach measure oo are 
those from which player H can win the game and the set of measures currently 
associated with such positions forms a mean-payoff progress measure. 

'To make the above intuitions precise, we introduce the notion of measure 
function, progress measure, and an algorithm for computing progress measures 
correctly. It is worth noticing that the progress-measure based approach as 
described in [13]; called SEPM from now on, can be easily recast equivalently 
in the form below. A measure function u: Ps +N maps each position v in 
the game to a suitable measure u(v). The order € of the measures naturally 
induces a pointwise partial order E on the measure functions defined in the 
usual way, namely, for any two measure functions jj, and Hə, we write 7, E nə if 
Ha (v) € pa(v), for all positions v. The set of measure functions over a measure 
space, together with the induced ordering C, forms a measure-function space. 


Definition 1 (Measure-Function Space). The measure-function space is the 
partial order F =(MF,C) whose components are defined as follows: 


1. MF £ Ps > Na is the set of all functions  € MF, called measure functions, 
mapping each position v € Ps to a measure u(v) € Nx; 
2. for all pı, u4 € MF, it holds that u, C pa if p (v) € ua(v), for all v € Ps. 


The -denotation (resp., H-denotation) of a measure function u € MF is the set 
lulle = n7" (oo) (resp., |ala = n7? (00)) of all positions having maximal (resp., 
non-mazimal) measure associated within p. 


Consider a position v with an adjacent u with measure 7. A measure update 
of 7 w.r.t. v is obtained by the stretch operator +: N x Ps > Nx, defined as 
5 - v € max{0,7 + wg(v)), which corresponds to the payoff estimate that the 
given position will obtain by choosing to follow the move leading to the u. 

A mean-payoff progress measure is such that the measure associated with 
each game position v need not be increased further in order to beat the actual 
payoff of the plays starting from v. In particular, it can be defined by taking into 
account the opposite attitude of the two players in the game. While the player © 
tries to push toward higher measures, the player will try to keep the measures 
as low as possible. A measure function in which the payoff of each @-position 
(resp., E-position) v is not dominated by the payoff of all (resp., some of) its 
adjacents augmented with the weight of v itself meets the requirements. 


Definition 2 (Progress Measure). A measure function p € MF is a progress 
measure if the following two conditions hold true, for all positions v € Ps: 
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1. u(u) -- v € u(v), for all adjacents u € Mv(v) of v, if v € Psg; 
2. n(u) +u € u(v), for some adjacent u € Mv(v) of v, if v € Ps 


The following theorem states the fundamental property of progress measures, 
namely, that every position with a non-maximal measures is won by player 


Theorem 1 (Progress Measure). ||| g € Wng, for all progress measures p. 


In order to obtain a progress measure from a given measure function, one can 
iteratively adjust the current measure values in such a way to force the progress 
condition above among adjacent positions. To this end, we define the lift operator 
lift: MF — MF as follows: 


lift(u)(v) ê erm +v:w E Mo(v)}, ifve Psa; 
min(u(w) +v : w € Mv(v)), otherwise. 

Note that the lift operator is clearly monotone and, therefore, admits a least 
fixpoint. A mean-payoff progress measure can be obtained by repeatedly applying 
this operator until a fixpoint is reached, starting from the minimal measure 
function 4j, = {v € Ps +> 0} that assigns measure 0 to all the positions in 
the game. The following solver operator applied to Ho computes the desired 
solution: sol £ Ifp u .lift(u): MF — MF. Observe that the measures generated by 
the procedure outlined above have a fairly natural interpretation. Each positive 
measure, indeed, under-approximates the weight that player © can enforce along 
finite prefixes of the plays from the corresponding positions. This follows from the 
fact that, while player & maximizes its measures along the outgoing moves, player 
minimizes them. In this sense, each positive measure witnesses the existence 
of a positively-weighted finite prefix of a play that player ® can enforce. Let 
S £ Y^ (wg(v) € N : v € PsA wg(v) > 0} be the sum of all the positive weights 
in the game. Clearly, the maximal payoff of a simple play in the underlying 
graph cannot exceed S. Therefore, a measure greater than S witnesses the 
existence of a cycle whose payoff diverges to infinity and is won, thus, by player 
G. Hence, any measure strictly greater than S can be substituted with the 
value oo. This observation establishes the termination of the algorithm and is 
instrumental to its completeness proof. Indeed, at the fixpoint, the measures 
actually coincide with the highest payoff player © is able to guarantee. Soundness 
and completeness of the above procedure have been established in [13], where the 
authors also show that, despite the algorithm requiring O(n - S) = O(n? . W) 
lift operations in the worst-case, with n the number of positions and W the 
maximal positive weight in the game, the overall cost of these lift operations is 
O(S - m-log S) = O(n: m-W -log(n- W)), with m the number of moves and 
O(log S) the cost of arithmetic operations to compute the stretch of the measures. 


4 Solving Mean-Payoff Games via Quasi Dominions 


Let us consider the simple example game depicted in Figure [I] where the shape 
of each position indicates the owner, circles for player © and square for its 
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opponent H, and, in each label of the form ¢/w, the letter w corresponds to the 
associated weight, where we assume k > 1. Starting from the smallest measure 
function 4o = (a,b, c, d +> 0], the first application of the lift operator returns 
lı = {a > k;b, c 0;d 1} = lift(us). After that step, the following iterations 
of the fixpoint alternatively updates positions c and d, since the other ones already 
satisfy the progress condition. Being c € Psg, the lift operator chooses for it the 
measure computed along the move (c, d), thus obtaining u5(c) = lift(u,)(c) = 
I4 (d) = 1. Subsequently, d is updated to y3(d) = lift(u;)(d) = ua(c) +1 = 2. A 
progress measure is obtained after exactly 2k 4-1 iterations, when the measure of c 
reaches value k and d value k-- 1. Note, however, that the choice of the move (c, d) 
is clearly a losing strategy for player H, as remaining in the highlighted region 
would make the payoff from position c diverge. T'herefore, the only reasonable 
choice for player 4 is to exit from that region by taking the move leading 
to position a. An operator able to diagnose this phenomenon early on could 
immediately discard the move (c,d) and jump directly to the correct payoff 
obtained by choosing the move to position a. As we shall see, such an operator 
might lose the monotonicity property and recovering the completeness of the 
resulting approach will prove more involved. 


In the rest of this article we devise a progress operator that does precisely 
that. We start by providing a notion of quasi dominion, originally introduced for 
parity games in El, which can be exploited in the context of MPGs. 


Definition 3 (Quasi Dominion). An set of positions Q C Ps is a quasi ®- 
dominion if there exists a -strategy Cg € Stre(Q), called -witness for Q, 
such that, for all H-strategies og € Strg (Q) and positions v € Q, the play t = 
play((o@,0g),v), called (o@,v)-play in Q, satisfies we(m) > 0. If the condition 
wg(7) > 0 holds only for infinite plays 7, then Q is called weak quasi &-dominion. 


C a/k Go Essentially, a quasi -dominion consists in a set Q of po- 
sitions starting from which player ® can force plays in Q of 
positive weight. Analogously, any infinite play that player ® can 

(e) force in a weak quasi -dominion has positive weight. Clearly, 
any quasi -dominion is also a weak quasi ©-dominion. More- 
over, the latter are closed under subsets, while the former are 

Fig.1: An MPG. mot. It is an immediate consequence of the definition above that 
all infinite plays induced by the @-witness, if any, necessarily have infinite weight 
and, thus, are winning for player ®. Indeed, every such a play 7 is regular, i.e. it 
can be decomposed into a prefix a’ and a simple cycle (1")*, i.e. 7 = «' (n^), 


since the strategies we are considering are memoryless. Now, wg((z^)*) > 0, so, 
wg(7^) > 0, which implies wg((7")") = oo. Hence, wg(7) = oc. 


Proposition 1. Let Q be a weak quasi G-dominion with og € Strg(Q) one of its 
-witnesses and Q* C Q. Then, for all B-strategies og € Strg(Q*) and positions 
v € Q* the following holds: if the (co. ,v)-play T = play((ce;g.. om), v) is 
infinite, then wg(v) = co. 
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From Proposition [1] it directly follows that, if a weak quasi @-dominion Q is 
closed w.r.t. its @-witness, namely all the induced plays are infinite, then it is a 
@-dominion, hence is contained in Wn. 


Consider again the example of Figure[1] The set of position Q £ (a, c, d) forms 
a quasi ©-dominion whose ©-witness is the only possible G-strategy mapping 
position d to c. Indeed, any infinite play remaining in Q forever and compatible 
with that strategy (e.g., the play from position c when player H chooses the move 
from c leading to d or the one from a to itself or the one from a to d) grants an 
infinite payoff. Any finite compatible play, instead, ends in position a (e.g., the 
play from c when player H chooses the move from c to a and then one from a 
to b) giving a payoff of at least k > 0. On the other hand, Q* £ (c, d) is only a 
weak quasi G-dominion, as player H can force a play of weight 0 from position 
c, by choosing the exiting move (c, a). However, the internal move (c, d) would 
lead to an infinite play in Q* of infinite weight. 


The crucial observation here is that the best choice for player H in any 
position of a (weak) quasi &-dominion is to exit from it as soon as it can, while 
the best choice for player © is to remain inside it as long as possible. The idea of 
the algorithm we propose in this section is to precisely exploit the information 
provided by the quasi dominions in the following way. Consider the example 
above. In position a player H must choose to exit from Q = fa, c, d], by taking 
the move (a, b), without changing its measure, which would corresponds to its 
weight k. On the other hand, the best choice for player H in position c is to 
exit from the weak quasi-dominion Q* = {c,d}, by choosing the move (c, a) 
and lifting its measure from 0 to k. Note that this contrasts with the minimal 
measure-increase policy for player H employed in [13], which would keep choosing 
to leave c in the quasi-dominion by following the move to d, which gives the 
minimal increase in measure of value 1. Once c is out of the quasi-dominion, 
though, the only possible move for player © is to follow c, taking measure k + 1. 
'The resulting measure function is the desired progress measure. 


In order to make this intuitive idea precise, we need to be able to identify 
quasi dominions first. Interestingly enough, the measure functions ju defined in the 
previous section do allow to identify a quasi dominion, namely the set of positions 
14 (0) having positive measure. Indeed, as observed at the end of that section, a 
positive measure witnesses the existence of a positively-weighted finite play that 
player @ can enforce from that position onward, which is precisely the requirement 
of Definition|3 In the example of Figue[i] Ho (0) = 0 and u7 * (0) = (a, c,d} are 
both quasi dominions, the first one w.r.t. the empty GQ-witness and the second 
one w.r.t. the -witness cg (d) = c. 


We shall keep the quasi-dominion information in pairs (14,0), called quasi- 
dominion representations (QDR, for short), composed of a measure function u and 
a G-strategy c, which corresponds to one of the -witnesses of the set of positions 
with positive measure in u. The connection between these two components is 
formalized in the definition below that also provides the partial order over which 
the new algorithm operates. 
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Definition 4 (QDR Space). The quasi-dominion-representation space is the 
partial order Q (QDR, C), whose components are defined as follows: 


1. QDR C MF x Stra is the set of all pairs o = (o,o) € QDR, called quasi- 
dominion-representations, composed of a measure function Ho € MF and a 
®-strategy o, € Stre(Q(e)), where Q(o) = ug * (0), for which: 

(a) Q(o) is a quasi @-dominion enjoying a, as a @-witness; 

(b) lloll is a -dominion; 

(c) no(v) < Holoolv)) +v, for all -positions v € Q(0) N Psg; 

(d) ug(v) < Holu) +v, for all B-positions v € Q(g) N Psg and u € Mv(v); 

2. for all o4, 0; € QDR, it holds that o, C 05 if Ho, E Ho, and op, (v) = o5, (v), 
for all -positions v € Q(g,) N Psg with uo, (v) = uo, (v). 


The a-denotation ||o||, of a QDR o, with a € {6,8}, is the a-denotation ||uolla 
of its measure function. 


Condition [Ia] is obvious. Condition [Ib] instead, requires that every position 
with infinite measure is indeed won by player © and is crucial to guarantee 
the completeness of the algorithm. Finally, Conditions [Ic] and [Id] ensure that 
every positive measure under approximates the actual weight of some finite play 
within the induced quasi dominion. This is formally captured by the following 
proposition, which can be easily proved by induction on the length of the play. 


Proposition 2. Let o be a QDR and vru a finite path starting at position v € Ps 
and terminating in position u € Ps compatible with the &-strategy oo. Then, 


Ho(v) < wg(vn) + polu). 


It is immediate to see that every MPG admits a non-trivial QDR space, 
since the pair (Ho, o), with 4o the smallest measure function and co the empty 
strategy, trivially satisfies all the required conditions. 


Proposition 3. Every MPG has a non-empty QDR space associated with it. 


The solution procedure we propose, called QDP M from now on, can intuitively 
be broken down as an alternation of two phases. The first one tries to lift the 
measures of positions outside the quasi dominion Q(@) in order to extend it, 
while the second one lifts the positions inside Q(o) that can be forced to exit 
from it by player H. The algorithm terminates when no new position can be 
absorbed within the quasi dominion and no measure needs to be lifted to allow 
the H-winning positions to exit from it, when possible. To this end, we define 
a controlled lift operator lift: QDR x 2P5 x 2Ps — QDR that works on QDRs and 
takes two additional parameters, a source and a target set of positions. The 
intended meaning is that we want to restrict the application of the lift operation 
to the positions in the source set S, while using only the moves leading to the 
target set T. The different nature of the two types of lifting operations is reflected 
in the actual values of the source and target parameters. 


lift(o, S, T) & o*, where 
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max{ ulu) Hv: u E Mo(v)NT}, ifv ESN Psg; 
ho (v) = 4 min(ug(u) -v: ue Mo(v) NAT}, ifveSnPsg; 
Iio(v), otherwise; 


and, for all &-positions v € Q(o*) N Pse, we choose e. (v) € argmax,e y, (ver 

Holu) + v, if ug«(v) A Holv), and as«(v) = go(v), otherwise. Except for the 
restriction on the outgoing moves considered, which are those leading to the 
targets in T, the lift operator acts on the measure component of a QDR very 
much like the original lift operator does. In order to ensure that the result is still 
a QDR, however, the lift operator must also update the @-witness of the quasi 
dominion. This is required to guarantee that Conditions [Ia] and [Ic] of Definition f] 
are preserved. If the measure of a @-position v is not affected by the lift, the 
GQ-witness must not change for that position. However, if the application of the 
lift operation increases the measure, then the -witness on v needs to be updated 
to any move (v, u) that grants measure 4g» (v) to v. In principle, more than one 
such move may exist and any one of them can serve as witness. 

'The solution corresponds | to the inflationary fixpoint |l [10][3] of the two 
phases mentioned above, sol £ ifp o. prg (prg,(0)): QDR — QDR, defined by 
the progress operators prg, and prg,. The first phase is computed by the operator 
prg,: QDR — QDR as follows: prg,(o) = supo, lift(o, Q(o), Ps)}. This operator 
is responsible of enforcing the progress condition on the positions outside the 
quasi dominion Q(o) that do not satisfy the inequalities between the measures 
along a move leading to Q(o) itself. It does that by applying the lift operator with 
Q(o) as source and no restrictions on the moves. Those position that acquire a 
positive measure in this phase contribute to enlarging the current quasi dominion. 
Observe that the strategy component of the QDR is updated so that it is a 
G-witness of the new quasi dominion. To guarantee that measures never decrease, 
the supremum w.r.t. the QDR-space ordering is taken as result. 


Lemma 1. ji, is a progress measure over Q(o), for all fizpoints o of prg,. 


'The second phase, instead, implements the mechanism intuitively described 
above, while analyzing the simple example of Figure[I] This is achieved by the 
operator prg, reported in Algorithm [I] The procedure iteratively examines the 
current quasi dominion and lifts the measures of the positions that must exit 
from it. Specifically, it processes Q(g) layer by layer, starting from the outer layer 
of positions that must escape. The process ends when a, possibly empty, closed 
weak quasi dominion is obtained. Recall that all the positions in a closed weak 
quasi dominion are necessarily winning for player @, due to Proposition [1] We 
distinguish two sets of positions in Q(g). Those that already satisfy the progress 
condition and those that do not. The measures of first ones already witness an 
escape route from Q(o). The other ones, instead, are those whose current choice 
is to remain inside it. For instance, when considering the measure function 4 in 
the example of Figure [1] position a belongs to the first set, while positions c and 
d to the second one, since the choice of c is to follow the internal move (c, d). 

Since the only positions that change measure are those in the second set, only 
such positions need to be examined. To identify them, which form a weak quasi 
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dominion A(9) strictly contained in Q(@), we proceed as follows. First, we collect 
the set npp(o) of positions in Q(o) that do not satisfy the progress condition, 
called the non-progress positions. Then, we compute the set of positions that 
will have no choice other than reaching npp(o), by computing the inflationary 
fixpoint of a suitable pre operator. 


npp(o) = (v € Q(g) N Pse : du € Mv 
U (v € Q(g) N Psg : Vu € Mv(v) polv) < Holu) + v]. 

pre(o, Q) = QU (v € Q(o) N Psg : os (v 
U (v € Q(o) N Psg : Vu € Mv(v) \ Q. tolv) < nolu) + v]. 


The final result is A(o) £ (ifp Q. pre(o, Q))(npp(o)). Intuitively, A(o) contains all 
the G-positions that are forced to reach npp(g) via the quasi-dominion -witness 
and all the E-positions that can only avoid reaching npp(@) by strictly increasing 
their measure, which player H wants obviously to prevent. 

It is important to observe that, from a functional view-point, the progress 
operator prg, would work just as well if applied to the entire quasi dominion Q(o), 
since it would simply leave unchanged the measure of those positions that already 
satisfy the progress condition. However, it is crucial that only the positions in 
A(@) are processed in order to achieve the best asymptotic complexity bound 
known to date. We shall reiterate on this point later on. 

At each iteration of the while-loop Alg. 1: Progress Operator 
of Algorithm [I] let Q denote the current —— — —— ———————— 
(weak) quasi dominion, initially set to signature prg: QDR > QDR 
A(@) (Line 1). It first identifies the posi- function prg. (o) 


tions in Q that can immediately escape 1 Q = A(o) 

from it (Line 2). Those are (i) all the ? while esc(o, Q) 7 0 do 
-position with a move leading outside — ? E € bep(o, Q) 

of Q and (ii) the -positions v whose 4 o + lift(o, E, Q) 

G-witness g, forces v to exit from Q, 5 Qc— QVE 


namely c,(v) ¢ Q, and that cannot 6 o + win(o, Q) 
strictly increase their measure by choos- 7 return o 

ing to remain in Q. While the condition | ———————————————————————— 
for H-position is obvious, the one for ©-positions require some explanation. The 
crucial observation here is that, while player © does indeed prefer to remain in 
the quasi dominion, it can only do so while ensuring that by changing strategy it 
does not enable infinite plays within Q that are winning for the adversary. In 
other words, the new G-strategy must still be a G-witness for Q and this can 
only be ensured if the new choice strictly increases its measure. The operator 
esc: QDR x 2*5 — 2P5 formalizes the idea: 


esc(o, Q) £ (v € Qn Psg : Mv(v) VQ Z 0) 
U {v E QN Pse : 0,(v) g QAVu E Mv(v) NQ. Holu) + v € uolv))}. 


Consider, for instance, the example in Figure |2] and a QDR o such that 
lHo = (a 3;b > 2;c,d, £ > 1;e — 0} and o, = (b — a; f£ — d}. In this case, 
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we have Q, = (a, b, c, d, £) and A(o) = {c,d,f}, since c is the only non-progress 
positions, d is forced to follow c in order to avoid the measure increase required to 
reach b, and f is forced by the @-witness to reach d. Now, consider the situation 
where the current weak quasi dominion is Q = (c, f£], i.e. after d has escaped 
from A(9). The escape set of Q is (c, f}. To see why the @-position f is escaping, 
observe that jj) (£) + £ = 1 = uo(f) and that, indeed, should player & choose to 
change its strategy and take the move (£,£) to remain in Q, it would obtain an 
infinite play with payoff 0, thus violating the definition of weak quasi dominion. 

Before proceeding, we want to stress an easy conse- 
quence of the definition of the notion of escape set and 
Conditions[Ic]and[1d]of Definition[4] i.e., that every escape 
position of the quasi dominion Q(g) can only assume its 
weight as possible measure inside a QDR p, as reported 
is the following proposition. This observation, together 
with Proposition B] ensures that the measure of a position 
v € Q(@) is an under approximation of the weight of all 
finite plays leaving Q(o). 


Fig. 2: Another MPG. 
Proposition 4. Let o bea QDR. Then, jig(v) = wg(v) > 0, for allv € esc(o, Q(o)). 


Now, going back to the analysis of the algorithm, if the escape set is non-empty, 
we need to select the escape positions that need to be lifted in order to satisfy the 
progress condition. The main difficulty is to do so in such a way that the resulting 
measure function still satisfies Condition [1d]of Definition H] for all the H-positions 
with positive measure. The problem occurs when a E-position can exit either 
immediately or passing through a path leading to another position in the escape 
set. Consider again the example above, where Q = A(g) = {c,d,f}. If position d 
immediately escapes from Q using the move (d, b), it would change its measure 
to u'(d) = u(b) +d = 2 > u(d) = 1. Now, position c has two ways to escape, 
either directly with move (c, a) or by reaching the other escape position d passing 
through f. The first choice would set its measure to (a) + c = 4. The resulting 
measure function, however, would not satisfy Condition [Id] of Definition [4] as the 
new measure of c would be greater than u’(d)+c = 2, preventing to obtain a QDR. 
Similarly, if position d escapes from Q passing through c via the move (c, a), we 
would have u” (d) = u"(c) +d = (u(a) +c) +d = 4 > 2 = u(b) + d, still violating 
Condition [1d] Therefore, in this specific case, the only possible way to escape is to 
reach b. The solution to this problem is simply to lift in the current iteration only 
those positions that obtain the lowest possible measure increase, hence position d 
in the example, leaving the lift of c to some subsequent iteration of the algorithm 
that would choose the correct escape route via d. To do so, we first compute the 
minimal measure increase, called the best-escape forfeit, that each position in the 
escape set would obtain by exiting the quasi dominion immediately. The positions 
with the lowest possible forfeit, called best-escape positions, can all be lifted at 
the same time. The intuition is that the measure of all the positions that escape 
from a (weak) quasi dominion will necessarily be increased of at least the minimal 
best-escape forfeit. This observation is at the core of the proof of Theorem 
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(see the appendix) ensuring that the desired properties of QDRs are preserved by 
the operator prg,. The set of best-escape positions is computed by the operator 
bep: QDRx2Ps — 2F5 as follows: bep(o, Q) £ argmin, cec, 9) bef (Ho, Q, v); where 
the operator bef: MF x2P5 x Ps > Næ computes, for each position v in a quasi 
dominion Q, its best-escape forfeit: 


max{u(u) + v — u(v) : u € Mo(v)\ Q}, ifv € Psa; 


bef(u,Q, v) = erae +u—p(v):u€ Mv(v)NVQ), otherwise. 


In our example, bef(jy, Q, c) = (a) + c — un(c) = 4— 1 = 3, while bef(j, Q, d) = 
p(b) 4- à — u(d) = 2 — 1 = 1. Therefore, bep(o, Q) = {d}. 

Once the set E of best-escape positions is identified (Line 3), the procedure 
lifts them restricting the possible moves to those leading outside the current quasi 
dominion (Line 4). Those positions are, then, removed from the set (Line 5), thus 
obtaining a smaller weak quasi dominion ready for the next iteration. 

'The algorithm terminates when the (possibly empty) current quasi dominion 
Q is closed. By virtue of Proposition [I] all those positions belong to Wng and 
their measure is set to oo by means of the operator win: QDR x 2Ps — QDR 
(Line 6), which also computes the winning @-strategy on those positions, as follows: 
win(o, Q) = o*, where por = u5[Q oo] and, for all &-positions v € Q(o*) Pse, 
we choose a. (v) € argmax,e (syn Helu) +2, if as (v) ¢ Q, and oo (v) = es(v), 
otherwise. Observe that, since we know that every -position v € QM Pss, whose 
current -witness leads outside Q, is not an escape position, any move (v, u) 
within Q that grants the maximal stretch polu) + v strictly increases its measure 
and, therefore, is a possible choice for a ©-witness of the -dominion Q. 

At this point, it should be quite evident that the progress operator prg, is 
responsible of enforcing the progress condition on the positions inside the quasi 
dominion Q(@), thus, the following necessarily holds. 


Lemma 2. [ip is a progress measure over Q(o), for all fixpoints o of prg,- 


In order to prove the correctness of the proposed algorithm, we first need to 
ensure that any quasi-dominion space Q is indeed closed under the operators 
prg, and prg,. This is established by the following theorem, which states that 
the operators are total functions on that space. 


Theorem 2. The operators prg, and prg, are total inflationary functions. 


Since both operators are inflationary, so is their composition, which admits 
fixpoint. Therefore, the operator sol is well defined. Moreover, following the same 
considerations discussed at the end of Section B] it can be proved the fixpoint is 
obtained after at most n - (S + 1) iterations. Let ifp, X . F(X) denote the k-th 
iteration of an inflationary operator F. Then, we have the following theorem. 


Theorem 3 (Termination). The solver operator sol = ifp o. prg, (prg,(o)) is 
a well-defined total function. Moreover, for every o € QDR it holds that sol(g) = 
(ifp o* . prg(prg,(o*)))(e), for some index k < n- (S -- 1), where n is the number 
of positions in the MPG and S = Y {wg(v) € N : v € PsA wg(v) > 0) the total 
sum of its positive weights. 
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As already observed before, F igure [1] exemplifies an infinite family of games 
with a fixed number of positions and increasing maximal weight k over which the 
SEPM algorithm requires 2k + 1 iterations of the lift operator. On the contrary, 
QDPM needs exactly two iterations of the solver operator sol to find the progress 
measure, starting from the smallest measure function 4o. Indeed, the first iteration 
returns a measure function j4 = sol(uo), with p, (a) = k, ui (b) = m (c) = 0, 
and j4,(d) = 1, while the second one u, = sol(z,) identifies the smallest progress 
measure, with u,(a) = (c) = k, u,(b) = 0, and p,(d) = k +1. From this 
observations, the next result immediately follows. 

Theorem 4. An infinite family of MPGs (Ox) exists on which QDPM requires 
a constant number of measure updates, while SEPM requires O(k) such updates. 

From Theorem [I] and Lemmas [I] and 2] it follows that the solution provided 

by the algorithm is indeed a progress measure, hence establishing soundness. 


Completeness follows from Theorem |3| and from Condition [Ib] of Definition 
that ensures that all the positions with infinite measure are winning for player ®. 


Theorem 5 (Correctness). ||sol(o)|| Wng, for every o € QDR. 


The following lemma ensures that each execution of the operator prg , strictly 
increases the measure of all the positions in A(@). 


Lemma 3. Let o* = prg, (0). Then, uo«(v) > polv), for all positions v € A(o). 


Recall that each position can at most be lifted S +1 = O(n- W) times and, 
by the previous lemma, the complexity of sol only depends on the cumulative 
cost of such lift operations. We can express, then, the total cost as the sum, over 
the set of positions in the game, of the cost of all the lift operations performed 
on that positions. Each such operation can be computed in time linear in the 
number of incoming and outgoing moves of the corresponding lifted position v, 
namely O((|Mv(v)| + |Mv~*(v)|) - log 5), with O(log S) the cost of each arith- 
metic operation involved. Summing all up, the actual asymptotic complexity of 
the procedure can, therefore, be expressed as O(n -m - W -log(n- W)). 
Theorem 6 (Complexity). QDPM requires time O(n-m-W -log(n- W)) to 
solve an MPG with n positions, m moves, and maximal positive weight W. 


5 Experimental Evaluation 


In order to assess the effectiveness of the proposed approach, we implemented 
both QDPM and SEPM [13], the most efficient known solution to the problem 
and the more closely related one to QDPM, in C++ within OINK [2]. OINK 
has been developed as a framework to compare parity game solvers. However, 
extending the framework to deal with MPGs is not difficult. The form of the 
arenas of the two types of games essentially coincide, the only relevant difference 
being that MPGs allow negative numbers to label game positions. We ran the 
two solvers against randomly generated MPGs of various sizes. 


! The experiments were carried out on a 64-bit 3.9GHz quad-core machine, with INTEL 
15-6600K processor and 8GB of RAM, running UBUNTU 18.04. 
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Figure [3] compares the solution time, — "f 
expressed in seconds, of the two algo-  4ọ[ 
rithms on 4000 games, each with 104 po- : 
sitions and randomly assigned weights 
in the range [715 x 103,15 x 10?]. The "n 
scale of both axes is logarithmic. The ex- - 
periments are divided in 4 clusters, each : 
containing 1000 games. The benchmarks jo E 
in different clusters differ in the maximal ; 
number m of outgoing moves per posi- E 
tion, with m € {10,20,40,80}. These 1L 
experiments clearly show that QDPM DUC WS WEE EE e 
substantially outperforms SEPM. Most QDPM 
often, the gap between the two algo- Fig.3: Random games with 10^ positions. 
rithms is between two and three orders 
of magnitude, as indicated by the dashed diagonal lines. It also shows that SEPM 
is particularly sensitive to the density of the underlying graph, as its performance 
degrades significantly as the number of moves increases. The maximal solution 
time was 21000 sec. for SEPM and 0.017 sec. for QDPM. Figure [4] instead, 
compares the two algorithms fixing the maximal out-degree of the underlying 
graphs to 2, in the left-hand picture, and to 40, in the right-hand one, while 
increasing the number of positions from 10? to 10? along the x-axis. Each picture 
displays the performance results on 2800 games. Each point shows the total time 
to solve 100 randomly generated games with that given number of positions, 
which increases by 1000 up to size 2- 10? and by 10000, thereafter. In both pictures 
the scale is logarithmic. For the experiments in the right-hand picture we had to 
set a timeout for SEPM to 45 minutes per game, which was hit most of the times 
on the bigger ones. Once again, the QDPM significantly outperforms SEPM on 
both kinds of benchmarks, with a gap of more than an order of magnitude on 
the first ones, and a gap of more than three orders of magnitude on the second 
ones. The results also confirm that the performance gap grows considerably as 
the number of moves per position increases. 


105 £ 


SEPM 


Fig.4: Total solution times in seconds of SEPM and QDPM on 5600 random games. 
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We are not aware of actual concrete benchmarks for MPGs. However, exploit- 
ing the standard encoding of parity games into mean-payoff games [25]. we can 
compare the behavior of SEPM and QDPM on concrete verification problems 
encoded as parity games. For completeness, Table [I] reports some experiments 
on such problems. The table reports the execution times, expressed in seconds, 
required by the two algorithms to solve instances of two classic verification 
problems: the Elevator Verification and the Language Inclusion problems. These 
two benchmarks are included in the PGSolver toolkit and are often used 
as benchmarks for parity games solvers. The first benchmark is a verification 
under fairness constraints of a simple model of an elevator, while the second 
one encodes the language inclusion problem between a non-deterministic Büchi 
automaton and a deterministic one. The results on various instances of those 
problems confirm that QDPM significantly outperforms the classic progress 
measure approach. Note also that the translation into MP Gs, which encodes 
priorities as weights whose absolute value is exponential in the values of the 
priorities, leads to games with weights of high magnitude. Hence, the results 
in Table [1] provide further evidence that QDPM is far less dependent on the 
absolute value of the weights. They also show that QDPM can be very effective 
for the solution of real-world qualitative verification problems. 

It is worth noting, though, that the 
translation from parity to MP Gs gives 
rise to weights that are exponentially | Elevator 


Benchmark |Positions| Moves|| SEPM} QDPM 


144 234!) 0.0661/0.0001 


: Elevator 2 564| 950| 8.80/0.0003 
distant from each other |25|. As a con- [Elevator 3 2688| 4544||4675.71|0.0017 
sequence, the resulting benchmarks Lang. Incl. 1 170| 1094 3.18|0.0021 

: : Lang. Incl. 2 304| 1222! 16.76/0.0019 

are not necessarily representative of Lang. Incl. 3 428| 878|) 20.250.0033 
MPGs, being a very restricted sub-  |Lang. Incl. 4 628| 1538|| 135.51|0.0029 
" : : : — |Lang. Incl. 5 509| 2126| 148.37|0.0034 
class. Nonetheless, they provide evi- pang. Incl. 6 835] 2914|| 834.90/0.0051 
dence of the applicability of the ap-  |Lang. Incl. 7 1658| 4544||2277.87|0.0100 


proach in practical scenarios. 
Table 1: Concrete verification problems. 


6 Concluding Remarks 


We proposed a novel solution algorithm for the decision problem of MPGs that 
integrates progress measures and quasi dominions. We argue that the integration 
of these two concepts may offer significant speed up in convergence to the solution, 
at no additional computational cost. This is evidenced by the existence of a 
family of games on which the combined approach can perform arbitrarily better 
than a classic progress measure based solution. Experimental results also show 
that the introduction of quasi dominions can often reduce solution times up to 
three order of magnitude, suggesting that the approach may be very effective 
in practical applications as well. We believe that the integration approach we 
devised is general enough to be applied to other types of games. In particular, 
the application of quasi dominions in conjunction with progress measure based 
approaches, such as those of and [1]. may lead to practically efficient quasi 
polynomial algorithms for parity games and their quantitative extensions. 
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Abstract. Partial-order reduction (POR) is a well-established technique 
to combat the problem of state-space explosion. We propose POR tech- 
niques that are sound for parity games, a well-established formalism for 
solving a variety of decision problems. As a consequence, we obtain the 
first POR method that is sound for model checking for the full modal 
p-calculus. Our technique is applied to, and implemented for the fixed 
point logic called parameterised Boolean equation systems, which pro- 
vides a high-level representation of parity games. Experiments indicate 
that substantial reductions can be achieved. 


1 Introduction 


In the field of formal methods, model checking [2] is a popular technique to anal- 
yse the behaviour of concurrent processes. However, the arbitrary interleaving 
of these parallel processes can cause an exponential blowup, which is known as 
the state-space explosion problem. Several approaches have been identified to 
alleviate this issue, by reducing the state space on-the-fly, i.e., while generating 
it. Two established techniques are symmetry reduction and partial-order re- 
duction (POR) [SI26]30]. Whereas symmetry reduction can only be applied to 
systems that contain several copies of a component, POR also applies to het- 
erogeneous systems. However, a major drawback of POR is that most variants 
at best preserve only a fragment of a given logic, such as LTL or CTL* with- 
out the next operator (LTL. x /CTL* x) [7] or the weak modal pi-calculus [28]. 
Furthermore, the variants of POR that preserve a branching time logic impose 
significant restrictions on the reduction by only allowing the prioritisation of 
exactly one action at a time. This decreases the amount of reduction achieved. 

In this paper, we address these shortcomings by applying POR on parity 
games. A parity game is an infinite-duration, two-player game played on a di- 
rected graph with decorations on the nodes, in which the players even (denoted 
©) and odd (denoted O) strive to win the nodes of the graph. An application of 
parity games is encoding a model checking question: a combination of a model, 
in the form of a transition system, and a formal property, formulated in the 
modal p-calculus [16]. In such games, every node v represents the combination 
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of a state s from the transition system and a (sub)formula y. Under a typical 
encoding, player © wins in v if and only if y holds in s. 

In the context of model checking, parity games suffer from the same state- 
space explosion that models do. Exploring the state space of a parity game under 
POR can be a very effective way to address this. Our contributions are as follows: 

— We propose conditions (Def. 4) that ensure that the reduction function used 
to reduce the parity game is correct, i.e., preserves the winning player of the 
parity game (Thm. [1]. 

— We identify improvements for the reduction by investigating the typical 
structure of a parity game that encodes a model checking question. 

— We illustrate how to apply our POR technique in the context of solving 
parameterised Boolean equation systems (PBESs) [10|—a fixed point logic 
closely related to LFP—as a high-level representation of a parity game. 

— We extend the ideas of with support for non-determinism and experiment 
with an implementation for solving PBESs. 

Our approach has two distinct benefits over traditional POR techniques that 
operate on transition systems. First, it is the first work that enables the use of 
partial-order reduction for model checking for the full modal p-calculus. Second, 
the conditions that we propose are strictly weaker than those necessary to pre- 
serve the branching structure of a transition system used in other approaches to 
POR for branching time logics [7128], increasing the effectiveness of POR. 

The experiments with our implementation for solving PBESs are quite promis- 
ing. Our results show that, in particular, those instances in which PBESs en- 
code model checking problems involving large state spaces benefit from the use 
of partial-order reduction. In such cases, a significant size reduction is possible, 
even when checking complex p-calculus formulae, and the time penalty of con- 
ducting the static analysis is more than made up for by the speed-up in the state 
space exploration phase. 


Related Work There are several proposals for using partial-order reduction for 
branching-time logics. Groote and Sellink [9] define several forms of confluence 
reduction and prove which behavioural equivalences (and by extension, which 
fragments of logics) are preserved. In confluence reduction, one tries to identify 
internal transitions that can safely be prioritised, leading to a smaller state 
space. Ramakrishna and Smolka propose a notion that coincides with strong 
confluence from [9], preserving weak bisimilarity and the corresponding logic 
weak modal p-calculus. 

Similar ideas are presented by Gerth et al. in [7]. Their approach is based on 
the ample set method and preserves a relation they call visible bisimulation 
and the associated logic CTL_x. To preserve the branching structure, they 
introduce a singleton proviso which, contrary to our theory, can greatly impair 
the amount of reduction that can be achieved (see our Example 3| page m}. 

Valmari describes the stubborn sets method for LTL_x model checking. 
In general, stubborn sets allow for larger reductions than ample sets. While 
investigating the use of stubborn sets for parity games, we identified a subtle issue 
in one of the stubborn set conditions (called D1 in [33]). When applied to LTSs 


Partial-Order Reduction for Parity Games 309 


or KSs, this means that LTL_x is not necessarily preserved. Moreover, using 
the condition in the setting of parity games may result in games with different 
winners; for an example, see our technical report [24]. In 21], we further explore 
the consequences of the faulty condition for stubborn-set based POR techniques 
that can be found in the literature. We here resort to a strengthened version of 
condition D1 that does not suffer from these issues. 

Similar to our approach, Peled applies POR on the product of a transition 
system and a Biichi automaton, which represents an LTL_x property. It is 
important to note, though, that this original theory is not sound, as discussed 
in [29]. Kan et al. improve on Peled's ideas and manage to preserve all of LTL. 
To achieve this, they analyse the Büchi automaton that corresponds to the LTL 
formula to identify which part is stutter insensitive. With this information, they 
can reduce the state space in the appropriate places and preserve the validity of 
the LTL formula under consideration. 

The recent work by Bonneland et al. is close to ours in spirit, applying 
stubborn-set based POR to reachability games. Such games can be used for 
synthesis and for model checking reachability properties. Although the conditions 
on reduction they propose seem unaffected by the aforementioned issue with D1, 
unfortunately, their POR theory is nevertheless unsound, as we next illustrate. 

In reachability games, player 1 tries to reach one of the goal states, while 
player 2 tries to avoid them. Bonneland et al. propose a condition R that guar- 
antees that all goal states in the full game are also reachable in the reduced game. 
However, the reverse is not guaranteed: paths that do not contain a goal state 
are not necessarily preserved, essentially endowing player 1 with more power. 
Consider the (solitaire) reachability game depicted on the 
right, in which all edges belong to player 2 and the only goal 
state is indicated with grey. Player 2 wins the non-reduced 
game by avoiding the goal state via the edges labelled with 
a and then b. However, {b} is a stubborn set—according to 
the conditions of [3]—in the initial state, and the dashed 
transitions are thus eliminated in the reduced game. Hence, player 2 is forced to 
move the token to the goal state and player 1 wins in the reduced game. In the 
mean time, the authors of [3] confirmed and resolved the issue in [4]. 


Qutline. We give a cursory overview of parity games in Section B] In Section B]we 
introduce partial-order reduction for parity games, and we introduce a further 
improvement in Section D.3] Section [A] briefly introduces the PBES fixed point 
logic, and in Section [5] we describe how to effectively implement parity-game 
based POR for PBESs. We present the results of our experiments of using parity- 
game based POR for PBESs in Section [6] We conclude in Section [7] 


2 Preliminaries 


Parity games are infinite-duration, two-player games played on a directed graph. 
The objective of the players, called even (denoted by ©) and odd (denoted by 
), is to win nodes in the graph. 
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Definition 1. A parity game is a directed graph G = (V, E,2,P), where 
— V is a finite set of nodes, called the state space; 
— ECV xV is a total edge relation; 
— Q:V >N is a function that assigns a priority to each node; and 

P : V => {9,0} is a function that assigns a player to each node. 


We write s — t whenever (s,t) € E. The set of successors of a node s is 
denoted with succ(s) = (t | s > t). We use O to denote an arbitrary player and 
O to denote its opponent. 

A parity game is played as follows: initially, a token is placed on some node 
of the graph. The owner of the node can decide where to move the token; the 
token may be moved along one of the outgoing edges. This process continues ad 
infinitum, yielding an infinite path of nodes that the token moves through. Such 
a path is called a play. A play v is won by player © if the minimal priority that 
occurs infinitely often along 7 is even. Otherwise, it is won by player 

'To reason about moves that a player may want to take, we use dia con- 
cept of strategies. A strategy oo : Vt — V for player O is a partial func- 
tion that determines where O moves the token next, after the token has passed 
through a finite sequence of nodes. More formally, for all sequences s1... Sn 
such that P(s,) = O, it holds that o5(s1...55) € succ(s,). If Sn belongs to 
O, Go (81...54) is undefined. A play s1,52,... is consistent with a strategy o if 
and only if ø(sı ... si) = s;41 for all i such that o(s1...5;) is defined. A player 
O wins in a node s if and only if there is a strategy co such that all plays that 
start in s and that are consistent with oo are won by player O. 


Example 1. Consider the parity game on the right. Here, 
priorities are inscribed in the nodes and the nodes are s: Qu Cy s2 
shaped according to their owner (© or O). Let m be 
an arbitrary, possibly empty, sequence of nodes. In this 
game, the strategy co, partially defined as eo (751) = s2 
and co(7s3) = $1, is winning for © in s, and s2. After m EK X> bi 
all, the minimal priority that occurs infinitely often along 
(5152)7 is 0, which is even. Player O can win node s3 with the strategy og(7s3) = 
s4. Note that player © is always forced to move the token from node s4 to s3. 


3  Partial-Order Reduction 


In model checking, arbitrary interleaving of concurrent processes can lead to 
a combinatorial explosion of the state space. By extension, parity games that 
encode model checking problems for concurrent processes suffer from the same 
phenomenon. Partial-order reduction (POR) techniques help combat the blowup. 
Several variants of POR exist, such as ample sets [26], persistent sets and 
stubborn sets . The current work is based on Valmari’s stubborn set theory 
as it can easily deal with nondeterminism |32]. 
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3.1 Weak Stubborn Sets 


Partial-order reduction relies on edge labels, here referred to as events and typ- 
ically denoted with the letter j, to categorise the set of edges in a graph and 
determine independence of edges. In a typical application of POR, such events 
and edge labellings are deduced from a high-level syntactic description of the 
graph structure (see also Section p. A reduction function subsequently uses 
these events when producing an equivalent reduced graph structure from the 
same high-level description. For now, we tacitly assume the existence of a set of 
events and edge labellings for parity games and refer to the resulting structures 
as labelled parity games. 


Definition 2. A labelled parity game is a triple L = (G,S,¢), where G = 
(V, E, Q,'P) is a parity game, S is a non-empty set of events and l : S — 2” is 
an edge labelling. 


For the remainder of this section, we fix an arbitrary labelled parity game L — 
(G, S, 0). We write s 2» t whenever s — t and (s,t) € /(j). The same notation 
extends to longer executions s 2-7", t. We say an event j is enabled in a node 
s, notation s 2s, if and only if there is a transition s 2s t for some t. The set of all 
enabled events in a node s is denoted with enabled (s). An event j is invisible if 
and only if s 25 t implies P(s) = P(t) and 2(s) = Q(t). Otherwise, j is visible. 

A reduction function indicates which edges are to be explored in each node, 
based on the events associated to the edges. Given some initial node $, such a 
function induces a unique reduced labelled parity game as follows. 


Definition 3. Given a node 3 € V and a reduction function r : V — 2°. The 
reduced labelled parity game induced by r and starting from 38 is defined as 
Le = (G,,S,£,), where £,(j) =L) N E. and Gr = (V,, E., 2,P) is such that: 
— E, = {(s,t) € E | 3j € r(s) : (s,t) € ((3)) is the transition relation under r; 
— V. = (s | SE;s) is the set of nodes reachable with E,, where E; is the 
reflexive transitive closure of E,. 


Note that a reduced labelled parity game is only well-defined when r(s) à 
enabled (s) # Ú for every node s € V,; if this property does not hold, E, is 
not total. Even if totality of E, is guaranteed, the same node s may be won by 
different players in L and L, if no restrictions are imposed on r. The follow- 
ing conditions on r, as we will show, are sufficient to ensure both. Below, we 
say an event j is a key event in s iff for all executions s 2777", s' such that 
ji d r(s),.... jn € r(s), we have s' 2,. Key events are typically denoted jrey. 


Definition 4. We say that a reduction function r : V — 29 is a weak stubborn 
set iff for all nodes s € V, the following conditions holdi] 


1 As noted before, the condition D1 that we propose is stronger than the version in 
literature [30[33] since that one suffers from the inconsistent labelling problem [21] 
which also manifests itself in the parity game setting, see our technical report [24]. 
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D1 For all j € r(s) and jı d r(s),.--,5n é r(s), if 5 n, 81 = dm Ia Sn E 
sl, then there are nodes 8',8,...,8], 4 such that s 3, s 35 «f 25 e 
s^. Furthermore, if j is invisible, then s; 4 s; for every 1 € i « n. 

D2w r(s) contains a key event in s. 

V If r(s) contains an enabled visible event, then it contains all visible events. 

I If an invisible event is enabled, then r(s) contains an invisible key event. 

L For every visible event j, every cycle in the reduced game contains a node 


s' such that j € r(s'). 


Below, we also use (weak) stubborn set to refer to the set of events r(s) in some 
node s. First, note that every key event, which we typically denote by j&ey, in 
a node s is enabled in s, by taking n — 0 in D2w; this guarantees totality of 
E,. Condition D1 ensures that whenever an enabled event is selected for the 
stubborn set, it does not disable executions not in r(s). A stubborn set can 
never be empty, due to D2w. In a traditional setting where POR is applied on 
a transition system, the combination of D1 and D2w is sufficient to preserve 
deadlocks. Condition V enforces that either all visible events are selected for the 
stubborn set, or none are. Condition L prevents the so called action-ignoring 
problem, where a certain event is never selected for the stubborn set and ignored 
indefinitely. Combined, I and L preserve plays with invisible events only. 

We use the example below to further illustrate the purpose of—and need for— 
conditions V, I and L. In particular, the example illustrates that the winning 
player in the original game and the reduced game might be different if one of 
these conditions is not satisfied. 


Example 2. See the three parity games of Figure |l| From left to right, these 
games show a reduced game under a reduction function satisfying D1 and D2w 
but not V, I or L, respectively. In each case, we start exploration from the node 
called §, using the reduction function to follow the solid edges; consequently, the 
winning strategy co for player © in the original game is lost. 


Note that the games in Figure [I] are from a subclass of parity games called 
weak solitaire, illustrating the need for the identified conditions even in restricted 


w 
w 


goj” oo: kj” 
D1, D2w, V, L D1, D2w, V, I 


Fig. 1. Three games that show the winner is not necessarily preserved if we drop one 
of the conditions V, I or L, respectively. The dashed nodes and edges are present in 
the original game, but not in the reduced game. The edges taken from 8 by the winning 
strategy for player © in the original game are indicated below each game. 
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settings. A game is weak if the priorities along all its paths are non-decreasing, 
i.e., if s > t then R(s) < Q(t). A game is solitaire if only one player can make 
non-trivial choices. Weak solitaire games can encode the model checking of safety 
properties; solitaire games can capture logics such as LTL and ACTL*. 

Before we argue for the correctness of our POR approach in the next section, 
we finish with a small example that illustrates how our approach improves over 
existing methods for branching time logics. 


Example 3. The conditions C1-C3 of Gerth et al. [7] preserve LTL_x and are 
similar in spirit to our conditions. However, to preserve the branching structure, 
needed for preservation of CTL.. x, the following singleton proviso is introduced: 
CA Either enabled (s) C r(s) or |r(s)| = 1. 

This extra condition can severely impact the amount of reduction achieved: 
consider the following two processes, where n > 1 is some large natural number. 


" a ; » 


The cross product of these processes contains (n + 1)? states. In the initial state, 
neither (a1,a1) nor {b1, 51 is a valid stubborn set, due to C4. However, the la- 
belled parity game constructed using these processes and the p-calculus formula 
v X.([-]X ^ uY-((—)Y V (as) true)), has a very similar shape that can be reduced 
by prioritising transitions that correspond to b; or b; for some 1 € 4 € n. Note 
that this formula cannot be represented in LTL; condition C4 is therefore essen- 
tial for the correctness. While several optimisations for CTL_x model checking 
under POR are proposed in [19], unlike our approach, those optimisations only 
work for certain classes of CTL.. x formulas and not in general. 


3.2 Correctness 


Condition D2w suffices, as we already argued, to preserve totality of the tran- 
sition relation of the reduced labelled parity game. Hence, we are left to argue 
that the reduced game preserves and reflects the winner of the nodes of the 
original game; this is formally claimed in Theorem [I] We do so by constructing 
a strategy in the reduced game that mimics the winning strategy in the original 
game. The plays that are consistent with these two strategies are then shown to 
be stutter equivalent, which suffices to preserve the winner. 

Fix a labelled parity game L = (G,5S,4), a node 3, a weak stubborn set r 
and the reduced labelled parity game L, = (Gr, S, £r) induced by r and $. We 
assume r and 3 are such that G, has a finite state space. Below, w is the set 
containing all natural numbers and the smallest infinite ordinal number. 


Definition 5. Let 7 = s95153... and nm’ = totıt2... be two paths in G. We 
say n and n' are stutter equivalent, notation x 9 «', if and only if one of the 
following conditions holds: 
— m and x’ are both finite and there exists a non-decreasing partial function 
f:w—> w, with f(0) = 0 and f (|v| 1) = |n’|-1, such that for all0 <i < || 
and i' € [f (i), f (4 4- 1)), it holds that P(s;) = P(ti) and R(si) = Q(t). 
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— r andr’ are both infinite and there exists an unbounded, non-decreasing total 
function f : w — w, with f(0) = 0, such that for alli and ' € |f (1), f (4- 1)), 
it holds that P(s;) = P(ti) and 2(s;) = Q(t). 


Lemma 1. All infinite stutter equivalent paths have the same winner. 


In the lemmata below, we write >, to stress which transition must occur in G,. 


Lemma 2. Suppose so 4 --- J", sn J, s, for jq € r(so),....jn € r(so) 
and j € r(so). Then for some sp,...,5,, both so 4, sp J'y -.. 2s sf, and 
CN AL NTC 

Lemma 3. Suppose so 2 sı 2s... such that ji € r(so) for every ji occurring 
on this execution. Then, the following holds: 

— If the execution ends in sn, there exists a key event jrey, and nodes so, ...,s;, 
such that sn 2% aL and so 2%, sh I = x49 and so...Sn = 
S059 +++ Sh- , 

— If the execution is infinite, there exists another execution so =. sp 25 
s 25... for some key event Jkey and sos; +: E $0898) .... 


We remark that Lemma [|3| also holds for reduced labelled parity games that 
have an infinite state space, but where all the events are finitely branching. The 
proof of correctness, viz., Theorem [1] uses the alternative executions described 
by Lemma P] and [3] For full details, we refer to [24]; we here limit ourselves to 
sketching the intuition behind the application of these lemmata. 


Example 4. The structure of Figure |2| in which parallel edges have the same 
label, visualises part of a game in which the solid edges labelled j;j2j3 are part 
of a winning play for player O. This play is mimicked by path that follows the 
edges jkey 271 Jis. j3, drawn with dashes. The new play reorders the events j1, j2 
and j3 according to the construction of Lemma B]and introduces the key events 
Jkey and dy according to the construction of Lemma [] 


'The following theorem shows that partial-order reduction preserves the winning 
player in all nodes of the reduced game. Its proof is inspired by and 
Lemma 8.21], and uses the aforementioned lemmata. 


Fig.2. Example of how jı, j2, j3 is mimicked by introducing je, and Shey and moving 
j2 to the front (dashed trace). Transitions that are drawn in parallel have the same 
label. 
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Theorem 1. If G, has a finite state space then it holds that for every node s 
in Gr, the winner of s in G, is equal to the winner of s in G. 


3.3 Optimising D2w 


The theory we have introduced identifies and exploits rectangular structures in 
the parity game. This is especially apparent in condition D1. However, par- 
ity games obtained from model checking problems also often contain triangular 
structures, due to the (sometimes implicit) nesting of conjunctions and disjunc- 
tions, as the following example demonstrates. 


Example 5. Consider the process (a || b):c, in which actions a and b are executed 
in (interleaved) parallel, and action c is executed upon termination of both a and 
b. The p-calculus property uX.([a]X ^ [b]X ^ (—) true), also expressible in LTL, 
expresses that the action c must unavoidably be done within a finite number of 
steps; clearly this property holds true of the process. Below, the LTS is depicted 
on the left and a possible parity game encoding of our liveness property on this 
state space is depicted on the right. The edges in the labelled parity game that 
originate from the subformula (—) true are labelled with d. 


Whereas the state space of the process can be reduced by prioritising a or b, the 
labelled parity game cannot be reduced due to the presence of a d-labelled edge 
in every node. For example, if s is the top-left node in the labelled parity game, 
then r(s) = {a,d} violates condition D1, since the execution s 4. exists, but 
s ^, does not. 


In order to deal with games that contain triangular structures, we propose a 

condition that is weaker than D2w. 

D2t There is an event j € r(s) such that for all jı ¢ r(s),...,j& € r(s), if 
s 2h sı 7, 5 sp, then either sn 4 or there are nodes s’,s/,...,8 


such that s 2, s' 4 s 7, ... 7", s' and for all i, s; = s} or s; 4 s}. 


T 


Theorem [I] holds even for reduction functions satisfying the weak stubborn set 
conditions in which condition D2t is used instead of condition DZw. The proof 
thereof resorts to a modified construction of a mimicking winning strategy that 
is based on Lemma [1] described below, instead of Lemma [3] 


Lemma 4. Let r be a reduction function satisfying conditions D1, D2t, V, I 


and L. Suppose so 25, sı 22, ... such that ji € r(so) for every ji occurring on 
this execution. Then, the following holds: 
— If the execution ends in sn, there exist a key event jkey and nodes so,..., Sh 
such that: 


p 
€ sn = s^ Or Sn = 54; and 
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j j j A 
e so 1, sh I- 2 sl and 80... Sn & Sosh... sL. l 
— If the execution is infinite, there exists another execution so 2%, sp 24 
j A 
s 2)... and 8081 +- = soshsh.... 


We remark that the concepts of triangular and rectangular structures bear sim- 
ilarities to the concept of weak confluence from [9]. 


4 Parameterised Boolean Equation Systems 


Parity games are used, among others, to solve parameterised Boolean equation 
systems (PBESs) [10], which, in turn, are used to answer, e.g., first-order modal 
u-calculus model checking problems [5]. In the remainder of this paper, we show 
how to apply POR in the context of solving a PBES (and, hence, the encoded 
decision problem). We first introduce PBESs and show how they induce labelled 
parity games. 

Parameterised Boolean equation systems are sequences of fixed point equa- 
tions over predicate formulae, i.e., first-order logic formulae with second order 
variables. A PBES is given in the context of an abstract data type, which is used 
to reason about data. Non-empty data sorts of the abstract data type are typ- 
ically denoted with the letters D and E. The corresponding semantic domains 
are D and E. We assume that sorts D and N represent the Booleans and the 
natural numbers respectively, and have B and N as semantic counterpart. The 
set of data variables is V, and its elements are usually denoted with d and e. To 
interpret expressions with variables, we use a data environment 6, which maps 
every variable in Y to an element of the corresponding sort. The semantics of an 
expression f in the context of such an environment is denoted [f]. For instance, 
[a < 2+ y]? holds true iff d(x) < 2 + ó(y). To update an environment, we use 
the notation [v/d], which is defined as ó[v/d](d) = v and ó[v/d](d') = o(d') for 
all variables d 4 d'. 

For lack of space, we only consider PBESs in standard recursive form ; 
a normal form in which each right-hand side of an equation is a guarded formula 
instead of an arbitrary (monotone) predicate formula. We remark that a PBES 
can be rewritten to SRF in linear time, while the number of equations grows 
linearly in the worst case Proposition 2]. 

Let X be a countable set of predicate variables. In the exposition that follows 
we assume for the sake of simplicity (but without loss of generality) that all 
predicate variables X € A are of type D. We permit ourselves the use of non- 
uniformly typed predicate variables in our example. 


Definition 6. A guarded formula $ is a disjunctive or conjunctive formula of 
the form: 


M Geg:Ej. fj ^ X5(g;)) or. N Vei:Es. f; > X;(gj) 
jeJ jeJ 


where J is an index set, each f; is a Boolean expression, referred to as guard, 
every ej is a (bound) variable of sort Ej, each gj is an expression of type D and 
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each X; is a predicate variable of type D. A guarded formula $ is said to be total 
if for Ron data environment 6, there is a j € J and v € E; such that [f;]ó[v/e;] 
holds true. 


The denotational semantics of a guarded formula is given in the context of a 
data environment 6 for interpreting data expressions and a predicate environment 
n : X — 2P, yielding an interpretation of X; (g;) as the truth value [g;]ó € n(X;). 
Given a predicate environment and a data environment, a guarded formula in- 
duces a monotone operator on the complete lattice (2P, C). By Tarski’s theorem, 
least (u) and greatest (v) fixed points of such operators are guaranteed to exist. 


Definition 7. A parameterised Boolean equation in SRF is an equation that 
has the shape (u.X(d:D) = $(d)) or (vX(d:D) = ¢(d)), where ó(d) is a to- 
tal guarded formula in which d is the only free data variable. A parameterised 
Boolean equation system in SRF is a sequence of parameterised Boolean equa- 
tions in SRF, in which no two equations have the same left-hand side variable. 


Henceforward, let € = (c1 X1(d:D) = qi(d)) ... (os X4 (d:D) = pn(d)) be a fixed, 
arbitrary PBES in SRF, where c; € {u, v]. The set of bound predicate variables of 
€, denoted bnd(£), is the set ( X1,..., Xn}. If the predicate variables occurring 
in the guarded formulae ;(d) of E are taken from bnd(£), then E is said to 
be closed; we only consider closed PBESs. Every bound predicate variable is 
assigned a rank, where ranke(X;) is the number of alternations in the sequence 
of fixpoint symbols vo4902...0;. Observe that ranke(.X;) is even iff e; = v. We 
use the function opg : bnd(£) — (V, ^) to indicate for each predicate variable in 
E whether the associated equation is disjunctive or conjunctive. As a notational 
convenience, we write J; to refer to the index set of the guarded formula q;(d), 
and we assume that the index sets are disjoint for different equations. 

The standard denotational fixed point semantics of a closed PBES associates 
a subset of D to each bound predicate variable (i.e., their meaning is independent 
of the predicate environment used to interpret guarded formulae). For details of 
the standard denotational fixed point semantics of a PBES we refer to [10]. We 
forego the denotational semantics and instead focus on the (provably equivalent, 
see e.g. [23]6]) game semantics of a PBES in SRF. 


Definition 8. The solution to E is a mapping [£] : bnd(£) > 2”, defined as 
[£]CX;) = (v € D | (Xi, v) is won by © in Ge}, where X; € bnd(£) and Ge is 
the parity game associated to €. The game Gg = (V, E, Q, P) is defined as: 

— V =bnd(€) x D is the set of nodes; 
E is the edge relation, satisfying (Xi, v) > (Xj, w) for given Xi, j € Ji, v 
and w if and only if for some ô, both [f;]ó[v/d] and w = [g;]9|v/d] hold; 
£2((X;,v)) = ranke(X;); and 

= P((X;,0)) = iff ope (X;) = 
Note that the parity game Ge may have an infinite state space when D is in- 
finite. In practice, we are often interested in the part of the parity game that 
is reachable from some initial node (X, v); this is often (but not always) finite. 
This is illustrated by the following example. 
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Example 6. Consider the following PBES in SRF: 


(v.X (b:B) = (b ^ X(false)) v 3n: N.n < 2A Y (b, if(b,n,0))) 
(uY (b:B,n:N) = true = Y (false, 0)) 


The six nodes in the parity game which are reachable from (X, true) are depicted 
in Figure [|3| The horizontally drawn edges all stem from the clause dn:N.n < 
2 ^Y (b, if(b,n,0)). Vertical edges stem from the clause b ^ X (false) (on the left 
or the clause true = Y(false,0) (on the right). The selfloop also stems from 
the clause true = Y (false, 0). Player O wins all nodes in this game, and thus 


true ¢ [£] (X). 


As suggested by the above example, 
each edge is associated to (at least) one 
clause in E€. Consequently, we can use 
the index sets J; to event-label the edges 
emanating from nodes associated with 
the equation for X;. We denote the set 
of all events in € by evt(£), defined as 
evt(£) — Ux,ebnace Ji. Event j € Jj is 
invisible if ranke(X;) = ranke(X;) and Fig.3. Reachable part of the parity 


ope (X;) 2 ope(X;), and visible other- game underlying the PBES of Exam- 
ple[6] when starting from node (X, true). 


(Y, true, 0) 
(Y, true, 1) 
(Y, true, 2) 


(Y, false, 0) 


wise. 


Definition 9. Let Ge be the parity game associated to E. The labelled parity 
game associated to E is the structure (Gg,evt(£), £), where Gg is as defined 
in Def.|8, and, for j € Ji, €(j) is defined as the set (((Xi,v), (Xj,w)) € E | 
[/;]ó[v/d| holds true and w = [gj]ó[v/d] for some ô}. 


5 PBES Solving Using POR 


A consequence of the partial-order reduction theorem is that a reduced parity 
game suffices for computing the truth value to X (e) for a given PBES € with X € 
bnd(£). However, D1, D2w/D2t and L are conditions on the (reduced) state 
space as a whole and, hence, hard to check locally. We therefore approximate 
these conditions in such a way that we can construct a stubborn set on-the-fly. 

From hereon, let E be a PBES in SRF and (G, S, £), with G = (V, E, Q,P), 
its labelled parity game. The most common local condition for L is the stack 
proviso LS [26]. This proviso assumes that the state space is explored with 
depth-first search (DFS), and it uses the Stack that stores unexplored nodes to 
determine whether a cycle is being closed. If so, the node will be fully expanded, 
i.e, r(s) 2 S. 

L5 For all nodes s € V,, either succg,(s)M Stack = or r(s) = S. 

Locally approximating conditions D1 and D2w requires a static analysis of 
the PBES. For this, we draw upon ideas from [17] and extend these to properly 
deal with non-determinism. To reason about which events are independent, we 
rely on the idea of accordance. 
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Definition 10. Let j,j’ € S. We define the accordance relations DNL, DNS, 
DNT and DNA on S as follows: M 
— j left-accords with j' if for all iin s,s! € V, if s 23, s', then also s IG s'. 
If j does not left-accord with j', we write (j, j") € DNL. 
— j square-accords with j’ if for all node S, $1; 582 EV, ifs 4, sı and s EN S2, 
then for some s' € V, sı 23 s' and s2 3 s'. If j does not square-accord with 
j' we write (j, ĵj') € DNS. 
=j bius AM with j' if for all nodes s, 81,52 € V, if s EN sı and s 2s s2, 
then s2 24 sı. If j does not triangle-accord with j' we write (j,j') € DNT. 
— j accords with j' if j square-accords or triangle-accords with j'. If j does not 
accord with j' we write (j,j') € DNA. 


Note that DNL and DNT are not necessarily symmetric. An illustration of the 
left-according, square-according and triangle-according conditions is given below: 


4! 7 ; a ql 


s —> 81 pay Paneer peer s— 81 psc 
| i i |; i K i | i aA 
s! $2 —— 3! $2 82 0— Tg $2 52 
Jj J 


Accordance relations safely approximate the independence of events. The depen- 
dence of events, required for satisfying D2w can be approximated using Gode- 
froid’s necessary enabling sets [B]. 


Definition 11. Let j be an event that is disabled in some node s. A necessary- 
enabling set (NES) for j in s is any set NES,(j) C S such that for every 
execution s 27"), there is at least one ji such that ji € NES,(1). 


For every node and event there might be more than one NES. In particular, every 
superset of a NES is also a NES. A larger-than-needed NES may, however, have a 
negative impact on the reduction that can be achieved. In a PBES with multiple 
parameters per predicate variable, computing a NES can be done by determining 
which parameters influence the validity of guards f; and which parameters are 
changed in the update functions gj. A more accurate NES may be computed 
using techniques to extract a control flow from a PBES [15]. 

The following lemmata show how the accordance relations and necessary- 
enabling set can be used to implement conditions D1, D2w and D2t, respec- 
tively. A combination of Lemma [|and [6]in a deterministic setting appeared as 
Lemma 1 in [17]. Note that as a notational convention we write R(j) to denote 
the projection (7' | (j, 7’) € R} of a binary relation. 


Lemma 5. A reduction function r satisfies D1 in node s € V if for all j € r(s): 
— if j is disabled in s, then NES,(j) C r(s) for some NES,; and 
— if j is enabled in s, then DNL(j) C r(s). 


Lemma 6. A reduction function r satisfies D2w in a node s € V if there is an 
enabled event j € r(s) such that DNS(j) C r(s). 


Lemma 7. A reduction function r satisfies D2t in a node s if there is an enabled 
event j € r(s) such that DNA(j) C r(s). 
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More reduction can be achieved if a PBES is partly or completely ‘deterministic’, 
in which case some of the conditions can be relaxed. We say that an event j is 
deterministic, denoted by det(j), if for all nodes t, t', t" € V, if t 4 t' and t 4 t", 
then also t = t". This means event-determinism can be characterised as follows: 


det(j) iff [f;]9 and [f;]6’ implies [g;]ó = [g;]ð for all 6,6’ with ó(d) = 9'(d). 


The following lemma specialises Lemma |5| and shows how knowledge of de- 
terministic events can be applied to potentially improve the reduction. 


Lemma 8. A reduction function r satisfies D1 in a node s if for all j € r(s): 
— if j is disabled in s, then NES,(3) € r(s) for some NES,; and 
— if det(j) and j is enabled in s, then DNS(j) C r(s) or DNL(j) C r(s). 
— if adet(j) and j is enabled in s, then DNL(j) C r(s). 


Since relations DNS and DNL are incomparable we cannot decide a priori which 
should be used for deterministic events. However, Lemmaj§|permits choosing one 
of the accordance sets on-the-fly. This choice can be made based on a heuristic 
function, similar to the function for NESs proposed in [17]. 


6 Experiments 


We implemented the ideas from the previous section in a prototype tool, called 
pbespor, as part of the mCRL2 toolset [5]; it is written in C++. Our tool 
converts a given input PBES to a PBES in SRF, runs a static analysis to compute 
the accordance relations (see Section B), and uses a depth-first exploration to 
compute the parity game underlying the PBES in SRF. The static analysis relies 
on an external SMT solver (we use Z3 in our experiments). To limit the amount 
of static analysis required and to improve the reduction, the implementation 
contains a rudimentary way of identifying whether the same event occurs in 
multiple PBES equations. Experiments are conducted on a machine with an Intel 
Xeon 6136 CPU @ 3 GHz, running mCRL2 with Git commit hash dd36f98875. 
To measure the effectiveness of our implementation, we analysed the following 
mCRL2 modelg?} Anderson's mutual exclusion protocol [I], the dining philoso- 
phers problem, the gas station problem [I1], Hesselink's handshake register [12], 
Le Lann's leader election protocol [18], Milner's Scheduler and the Krebs 
cycle of ATP production in biological cells (model inspired by [25]). Most of 
these models are scalable. Each model is subjected to one or more requirements 
phrased as mCRL2's first-order modal p-calculus formulae. Where possible, Ta- 
ble [1] provides a CTL* formula that captures the essence of the requirement. 
We analyse the effectiveness of our partial-order reduction technique by mea- 
suring the reduction of the size of the state space, and the time that is required to 
generate the state space. Since the static analysis that is conducted can require 
a non-neglible amount of time, we pay close attention to the various forms of 
static analysis that can be conducted. In particular, we compare the total time 
and effectiveness (in terms of reduction) of running the following static analysis: 


? The models are archived online at https: //doi.org/10.5281/zenodo.3602969 
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Table 1. Runtime (analysis + exploration; in seconds) and number of states when 
exploring either the full state space or the reduced state space, for four different static 
analysis approaches. Figures printed in boldface indicate which of the additional static 
analyses is able to achieve the largest reduction over ‘basic’ (if any). 


full basic +DNL +NES +D2t 

model property nodes time nodes time nodes time nodes time nodes time 
gas station.c3 JO accept 1197 0.14 1077 0.98 1077 2.48 1077 1.87 735 .62 
gas station.c3 3030 pumping 126 0.15 967 0.98 967 2.61 967 1.99 967 .12 
gas station.c3 no deadloc 1197 0.18 735 0.95 735 2.52 735 2.04 735 52 
scheduler8 no deadloc 3073 0.29 34 0.19 34 0.70 34 0.51 34 0.35 
schedulerl0 no deadloc| 15536 .65 42 0.25 42 0.90 42 0.65 42 0.42 
anderson. VOcs 23597 4.59 2957 2.85 2957 6.47 2957 3.89 2957 4.61 
hesselink cache consistency — 91009 5.28 82602 8.19 83602 12.12 81988 9.00 71911 8.51 
dining10 no deadlock 154451 17.90 4743 0.76 4743 1.6 4743 .42 4743 02 
krebs.3 VO energy 238877 24.38 232273 24.59 232273 25.62 209345 21.73 232273 24.42 
gas station.c6 JO accept 186 38 38.00 150741 40.55 150741 45.50 150741 43.16 75411 21.40 
gas station.c6 3030 pumping 192700 38.63 114130 27.35 114130 31.42 114130 30.49 114130 29.74 
gas station.c6 no deadloc 18638 42.50 75411 21.00 75411 24.88 75411 24.01 75411 23.02 
schedulerl4 no deadloc| 344065 53.14 58 0.37 58 13 58 0.97 58 0.61 
hesselink (wr dO fin) 1047233 61.02 1013441 82.44 1013441 86.49 1013441 84.59 791273 61.56 
(wr = VOfin) 1047232 70.14 791320 64.05 791374 66.53 749936 62.98 791268 67.59 

VO energy 1047406 124.30 971128 117.38 971128 117.41 843349 101.51 971128 117.41 

lann.5 consistent data 818104 142.38 818104 170.18 818104 175.87 818104 177.78 761 239 155.22 
anderson.5 no deadloc 689901 142.63 257944 73.62 257672 79.91 257711 78.67 257918 76.47 
lann.5 no data loss 1286452 199.74 453130 73.28 453130 77.31 453130 74.40 453130 75.52 
dining10 VOVS eat 1698951 225.10 101185 12.37 101056 13.55 101238 13.01 101022 12.69 
anderson.| | VOcs 3964599 1331.91 124707 63.83 124707 73.87 124707 68.67 124707 69.68 


— computing left-accordance (DNL) vs. over-approximating it with all events. 
— computing a NES vs. over-approximating it with the set of all events S. 
— using D2w vs. the use of D2t (i.e., use Lemma [6| vs. Lemma [); 


As a baseline for comparisons, we take a basic static analysis (over-approximated 
DNL, over-approximated NES, D2w), see column ‘basic’ in Table [i] In order to 
guarantee termination of the static analysis phase, we set a timeout of 200ms per 
formula that is sent to the solver. Table []reports on the statistics we obtained for 
exploring the full state space and the four possible POR. configurations described 
above; the table is sorted with respect to the time needed for a full exploration. 
The time we list consists of the time needed to conduct the analysis plus the 
time needed for the exploration. 


For most small instances, the time required for static analysis dominates any 
speed-up gained by the state space reduction. When the state spaces are larger, 
achieving a speed-up becomes more likely, while the highest overhead suffered 
by ‘basic’ is 55% (Hesselink, cache consistency). Significant reduction can be 
achieved even for non-trivial properties, such as ‘lann.5’ with ‘no data loss’. 
Scheduler is an extreme case: its processes have very few dependencies, leading 
to an exponential reduction, both in terms of the state space size and in terms 
of time. In several cases, the use of a NES or D2t brings extra reduction (high- 
lighted in bold). Moreover, the extra time required to conduct the additional 
analysis seems limited. The use of DNL, on the other hand, never pays off in our 
experiments; it even results in a slightly larger state space in two cases. 
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We note that there are also models, not listed in Table |1| where our static 
analysis does not yield any useful results and no reduction is achieved. Even if 
in such cases a reduction would be possible in theory, the current static analysis 
engines are unable to deal with the more complex data types often used in such 
models; e.g., recursively defined lists or infinite sets, represented symbolically 
with higher-order constructions. This calls for further investigations into static 
analysis theories that can effectively deal with complex data. 

Finally, we point out that in the case of, e.g., the dining philosophers problem, 
the relative reduction under the ‘no deadlock’ property is much better than 
under the 'VL)|VO eat! property. This demonstrates the impact properties can 
have on the reductions achievable, and it also points at a phenomenon we have 
not stressed in the current work, viz., the impact of identifying events on the 
reductions achievable. We explain the phenomenon in the following example. 


a 


Example 7. Consider the LTS and the parity game on 

the right. The parity game encodes the property ài ài 

vX.([-]X ^ Vi. uY-([a;]Y ^ (—)true)), which is equiva- 

lent to VL]Oa;, on this LTS. The event ry represents ay 

the transition from fixpoint X into Y, which does not ay [1 zy |1 

involve an action from the LTS. Note that the com- ry 0 zro | 
1 


plete state space is encoded in the fixpoint X. Due to 1 Nh 
the absence of some transitions in the part of the state 
space encoded in fixpoint Y , neither a; nor az is accord- 
ing with zy. Hence, the only stubborn set in the initial a 
node is (a4, a2, xy}, which yields no reduction. ar 


Improving the event identification procedure can yield more reduction. For 
instance, if, for each i (bound in the universal quantifier), a different event xy; 
is created, then both a1, £y2 and as,zy; will be according. If we disregard the 
visibility of xy, and ry», four nodes can be eliminated. 


7 Conclusion 


We have presented an approach for applying partial-order reduction on parity 
games. This has two main advantages over POR applied on LTSs or Kripke 
structures: our approach supports the full modal ju-calculus, not just a fragment 
thereof, and the potential for reduction is greater, because we do not require 
a singleton proviso. Furthermore, we have shown how the ideas can be imple- 
mented with PBESs as a high-level representation. In future work, we aim to 
gain more insight into the effect of identifying events across PBES equations in 
several ways. We also want to investigate the possibility of solving a reduced 
parity game while is it being constructed. In certain cases, one may be able to 
decide the winner of the original game from this partial solution. 
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Abstract. We study identification in the limit using polynomial time 
and data for models of w-automata. On the negative side we show that 
non-deterministic w-automata (of types Büchi, coBüchi, Parity or Muller) 
can not be polynomially learned in the limit. On the positive side we 
show that the w-language classes IB, IC, IP, and IM that are defined 
by deterministic Büchi, coBüchi, parity, and Muller acceptors that are 
isomorphic to their right-congruence automata (that is, the right congru- 
ences of languages in these classes are fully informative) are identifiable 
in the limit using polynomial time and data. We further show that for 
these classes a characteristic sample can be constructed in polynomial 
time. 


Keywords: identification in the limit, characteristic sample, w-regular. 


1 Introduction 


With the growing success of machine learning in efficiently solving a wide spec- 
trum of problems, we are witnessing an increased use of machine learning tech- 
niques in formal methods for system design. One thread in recent literature 
uses general purpose machine learning techniques for obtaining more efficient 
verification/synthesis algorithms. Another thread, following the automata theo- 
retic approach to verification |33,21] works on developing grammatical inference 
algorithms for verification and synthesis purposes. Grammatical inference (aka 
automata learning) refers to the problem of automatically inferring from exam- 
ples a finite representation (e.g. an automaton, a grammar, or a formula) for 
an unknown language. The term model learning [31] was coined for the task of 
learning an automaton model for an unknown system. A large body of works 
has developed learning techniques for different automata types (e.g. I/O au- 
tomata [1], register automata [20], symbolic automata [14], w-automata [7], and 
program automata [25]) and has shown its usability in a diverse range of tasks.? 

In grammatical inference, the learning algorithm does not learn a language, 
but rather a finite representation of it. Complexity of learning algorithms may 
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finding security bugs [10], error localization [11], and code refactoring [26,29]. 
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vary greatly by switching representations. For instance, if one wishes to learn 
regular languages, she may consider representations using deterministic finite au- 
tomata (DFAs), non-deterministic finite automata (NFAs), regular expressions, 
linear grammars etc. Since the translation results between two such formalisms 
are not necessarily polynomial, a polynomial learnability result for one repre- 
sentation does not necessarily imply a polynomial learnability result for another 
representation. Let C be a class of representations C with a size measure size(C) 
(e.g. for DFAs the size measure can be the number of states in the minimal 
automaton). We extend size(-) to the languages recognized by representations in 
C by defining size(L) to be the minimum of size(C) over all C representing L. In 
this paper we restrict attention to automata representations, namely, acceptors. 


There are various learning paradigms considered in the grammatical inference 
literature, roughly classified into passive and active. We mention here the two 
central ones. In passive learning the model of learning from finite data refers to 
the following problem: given a finite sample T C X* x {0,1} of labeled words, a 
learning algorithm A should return an acceptor C that agrees with the sample 
T. That is, for every (w,l) € T the following holds: w € [C] iff | = 1 (where 
[C] is the language accepted by C). The class C is identifiable in the limit using 
polynomial time and data if and only if there exists a polynomial time algorithm 
A that takes as input a labeled sample T and outputs an acceptor C € C that 
is consistent with T', and A also satisfies the following condition. If L is any 
language recognized by an automaton from class C, then there exists a labeled 
sample Tr, consistent with L of length bounded by a polynomial in size(L), and 
for any labeled sample T consistent with L such that Tr, C T, on input T the 
algorithm A produces an acceptor C that recognizes L. In this case, Tr is termed 
a characteristic sample for the algorithm A. In some cases (e.g., DFAs) there 
is also a polynomial time algorithm to compute a characteristic sample for A, 
given an acceptor C € C. 


In active learning the model of query learning [5] assumes the learner commu- 
nicates with an oracle (sometimes called teacher) that can answer certain types 
of queries about the language. The most common type of queries are member- 
ship queries (is w € L where L is the unknown language) and equivalence queries 
(is [A] = L where A is the current hypothesis for an acceptor recognizing L). 
Equivalence queries are typically assumed to return a counterexample, i.e. a 
word in [A] V Z or in L \ [A]. 

With regard to w-automata (automata on infinite words) most of the works 
consider query learning. The representations learned so far include: (L)s [15], a 
non-polynomial reduction to finite words; families of DFAs (FDFA) [7,8,6,22]; 
strongly unambiguous Büchi automata (SUBA) [3]; and deterministic weak par- 
ity automata (DWPA) [23]. Among these only the latter is learnable in polyno- 
mial time using membership queries and proper equivalence queries. 


One of the main obstacles in obtaining a polynomial learning algorithm for 
w-regular languages is that they do not in general have a Myhill-Nerode char- 
acterization; that is, there is no theorem correlating the states of a minimal 
automaton of some of the common automata types (Büchi, Parity, Muller, etc.) 
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to the equivalence classes of the right congruence of the language. The right con- 
gruence relation for an w-language L relates two finite words x and y iff there 
is no infinite suffix z differentiating them, that is x er y (for x,y € X*) iff 
Yz € M". rz € L = yz € L. In our quest for finding a polynomial query 
learning algorithm for a subclass of the w-regular languages, we have studied 
subclasses of languages for which such a relation holds [4], and termed them 
fully informative. We use IB, IC, IP, IM to denote the classes of languages that 
are fully informative of type Büchi, coBüchi, Parity and Muller, respectively. A 
language L is said to be fully informative of type X for X € (IB, C, P, M] if there 
exists a deterministic automaton of type X which is isomorphic to the automaton 
derived from ~z. While a lot of properties about these classes are now known, 
in particular that they span the entire hierarchy of w-regular properties [34], a 
polynomial learning algorithm for them has not been found yet. 

In this paper we show that the classes IB, IC, IP, IM can be identified in the 
limit using polynomial time and data. We further show that there is a polynomial 
time algorithm to compute a characteristic sample given an acceptor C € IX. A 
corollary of this result is that the class of languages accepted by DWIPAs (which 
as mentioned above is polynomially learnable in the query learning setting) also 
has a polynomial size characteristic sample. On the negative side, we show that 
the classes NBA, NCA, NPA, NMA of non-deterministic Büchi, coBüchi, Parity 
and Muller automata, resp., cannot be identified in the limit using polynomial 
data. 


2 Preliminaries 


Automata and Acceptors An automaton is a tuple A = (X, Q, q,,0) consisting of 
a finite totally ordered alphabet X of symbols, a finite set Q of states, an initial 
state q, € Q, and a transition function ó : Q x X — 29. A run of an automaton 
on a finite word v = a1a5...à,, is a sequence of states qo, q1,..., qs such that 
do = qı, and for each i > 0, qi+ı € ó(qi, ai41). A run on an infinite word is 
defined similarly and results in an infinite sequence of states. We say that .A is 
deterministic if |6(q,a)| € 1 and complete if |5(q,a)| > 1, for every q € Q and 
a € X. We extend 6 to domain Q x X* in the usual manner, and abbreviate 
ó(q,c) = {q'} as ô(q, 0) = q'. 

By augmenting an automaton with an acceptance condition a, obtaining a 
tuple (X, Q, q., 0, 0), we get an acceptor, a machine that accepts some words and 
rejects others. An acceptor accepts a word if at least one of the runs on that word 
is accepting. For finite words the acceptance condition is a set F C Q and a run 
on a word v is accepting if it ends in an accepting state, i.e., if ó(q,, v) contains 
an element of F. For infinite words, there are various acceptance conditions in 
the literature; we consider four: Büchi, coBüchi, parity, and Muller. The Büchi 
and coBüchi acceptance conditions are also a set F C Q. A run of a Büchi 
automaton is accepting if it visits F infinitely often. A run of a coBiichi is 
accepting if it visits F only finitely many times. A parity acceptance condition 
is a map & : Q — N assigning each state a natural number termed a color (or 
priority). A run is accepting if the minimum color visited infinitely often is odd. 
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A Muller acceptance condition is a set of sets of states a = (F1, Fo,..., FX] for 
some k € N and F; C Q for i € [1..k]. A run of a Muller automaton is accepting if 
the set S of states visited infinitely often in the run is a member of a. We use [.A] 
to denote the set of words accepted by a given acceptor A. We use NBA, NPA, 
NMA, NCA for non-determinstic Büchi, parity, Muller and coBüchi, automata. 
We use NBA, NPA, NMA and NCA for the classes of languages they recognize. 
The first three recognize the full class of w-regular languages while the forth only 
a subset of it. 


Right congruences An equivalence relation ~ on X* is a right congruence if 
x ~ y implies zv ~ yv for every x, y, v € X*. The index of ~, denoted | ~| is the 
number of equivalence classes of ~. Given a language L C X* its canonical right 
congruence ~z, is defined as follows: x ~z y iff Vz € X* we have zz c L ==> 
yz € L. For a word v € X* the notation [v] is used for the equivalence class of 
~ in which v resides. 

With a right congruence ~ of finite index one can naturally associate an 
automaton M~ = (3,Q,q,,9) as follows: the set of states Q consists of the 
equivalence classes of ~. The initial state q, is the equivalence class [e]. The 
transition function ô is defined by ó([u],a) = [ua]. Similarly, given a complete 
deterministic automaton M = (X, Q, q,,ô) we can naturally associate with it a 
right congruence as follows: x ~m y iff M reaches the same state when reading 
x or y. The Myhill-Nerode Theorem states that a language L is regular iff ~g 
is of finite index. Moreover, if L is accepted by a DFA A then ~4 refines ~z. 
Finally, the index of ~z gives the size of the minimal complete DFA for L. 

For an w-language L C X", the right congruence ~z, is defined similarly, by 
quantifying over w-words. That is, x ~g y iff Vz € X" we have rze L ==> 
yz € L. Given a deterministic automaton M we can define ~m exactly as for 
finite words. However, for w-regular languages, the relation ~z does not suffice to 
obtain a “Myhill-Nerode” characterization. As an example consider the language 
L = (a4- b)* (bba)". We have that ~z consists of just one equivalence class, since 
for any z € X* and w € X" we have that zw € L iff w has (bba)” as a suffix. 
But an w-acceptor recognizing L obviously needs more than a single state. 


The classes IB, IC, IP and IM A language L is in (resp., IC, IP, IM) if 
there exists a deterministic Büchi (resp., coBüchi, parity, Muller) acceptor .A 
such that L = [A] and there is a 1-to-1 relationship between the states of A 
and the equivalence classes of ~z: if z ~z y then x and y reach the same state 
q in A, and an w-word z is accepted from q iff zz € L (which holds iff yz € L). 
These classes are more expressive than one might conjecture, it was shown in [4] 
that in every class of the infinite Wagner hierarchy [34] there are languages in 
IM and IP. Moreover, in a small experiment reported in [4], among randomly 
generated Muller automata, the vast majority turned out to be in IM. 


Examples and samples Since we need finite representations of examples, w-words 
in our case, we work with ultimately periodic words, that is, words of the form 
u(v)" where u € X* and v € X*. It is known that two regular w-languages are 
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equivalent iff they agree on the set of ultimately periodic words, so this choice 
is not limiting. The example u(v)" is concretely represented by the pair (u, v) 
of finite strings, and its length is |u| + |v|. A labeled example is a pair (u(v)”, 1), 
where the label / is either 0 or 1. A sample is a finite set of labeled examples 
such that no example is assigned two different labels. The length of a sample 
is the sum of the lengths of the examples that appear in it. A sample T and a 
language L are consistent with each other if and only if for every labeled example 
(u(v)"?,l) € T, l= 1 iff u(v)* € L. A sample and an acceptor are consistent with 
each other if and only if the sample and the language recognized by the acceptor 
are consistent with each other. The following results give two useful procedures 
on examples that are computable in polynomial time. 


Claim 1. Given ui,u3 € X* and vi,v € X*, if ui(vi)? Æ u2(v2)” then they 
differ in at least one of the first L symbols, where £ = max(|ui]|, |u]) + |vi|- [v2]. 


Let suffizes(u(v)”) denote the set of all w-words that are suffixes of u(v)”. 


Claim 2. The set suffizes(u(v)^) consists of at most |u| -- |v| different examples: 
one of the form u'(v)” for every nonempty suffix u’ of u, and one of the form 
(v2v1)" for every division of v into a non-empty prefix and suffiz as v = viva. 


Identification in the limit using polynomial time and data We consider the no- 
tion of identification in the limit using polynomial time and data. This criterion 
of learning was introduced by [16], who showed that regular languages of finite 
strings represented by DFAs are learnable in this sense. We follow a more gen- 
eral definition given by [19]. The definition has two requirements: (1) a learning 
algorithm A that runs in polynomial time on a set of labeled examples and 
produces a hypothesis consistent with the examples, and (2) that for every lan- 
guage L in the class, there exists a set 77, of labeled examples of size polynomial 
in a measure of size of L such that on any set of labeled examples containing 
Tr, the algorithm A outputs a hypothesis correct for L. Condition (1) ensures 
polynomial time, while condition (2) ensures polynomial data. The latter is not 
a worst-case measure; there could be arbitrarily large finite samples for which 
A outputs an incorrect hypothesis. However, de la Higuera shows that identifi- 
ability in the limit with polynomial time and data is closely related to a model 
of a learner and a helpful teacher introduced by [17]. 


3 Negative Results 


We start with negative results. We show that when the representation at hand 
is non-deterministic, polynomial identification is not feasible. 


Theorem 3. The class NBA cannot be identified in the limit using polynomial 
data. 


Proof. The proof follows the idea given in the negative result for learning in the 
limit NFAs from polynomial data [19]. For any integer M > 2, let pi,..., pm be 
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Fig. 1: The NBA Bm for M = 5. 


the set of all primes less than or equal to M. For each such M, consider the NBA 
By over a two letter alphabet X = {a,b} with pı +po+...+pm+2 states, where 
state 0 has a-transitions to state (p, 1) for each p € (pi, po, ..., Pm}. State (p, i) 
has an a-transition to state (p, 2 1) where @, is addition modulo p. All states 
except the states (p, 0) have a b-transition to state b. The state b has a self-loop 
on b. The only accepting state is b. The NBA By for M = 5 is given in Fig. 1. 

The NBA Bm accepts the set of all words of the form a¥b® such that k is 
not a positive multiple of £ = pı - pa::- Pm. Note that the size of the shortest 
ultimately periodic word in a*b” \ [Bm] is + 1, and thus, to distinguish the 
language [Bm] from the language a*b“”, a word of at least this size must be 
provided. Since the number of primes not greater than M is O(M/log M) and 
since each prime is of size at least 2 the data must be of size at least 29 (M/log M) 
while the number of states of By; is O(M?). 


Since NBAs are a special case of non-deterministic parity automata (NPA) 
and non-deterministic Muller automata (NMA) it follows that these models too 
cannot be identified in the limit using polynomial data. Note that indeed the 
NBA in the proof of Theorem 3 can be regarded as an NPA by setting the color 
of state b to 1 and the color of all other states to 0. Likewise it can be regarded 
as an NMA by defining the accepting set as {{b}}. 


Corollary 1. The classes NPA and NMA cannot be identified in the limit using 
polynomial data. 


While NBAs are not a special case of non-deterministic coBüchi automata 
(NCA) it can be shown that NCA as well cannot be identified in the limit from 
polynomial data, which is in some sense surprising, since NCAs are not more 
expressive than DCAs, their deterministic counterpart, and accept a very small 
subclass of the regular w-languages. 


Theorem 4. The class NCA cannot be identified in the limit using polynomial 
data. 
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Proof. The proof is almost identical to that of Theorem 3. The only difference is 
that it considers the automaton Cm that takes exactly the same form as 5m from 
that proof but switching accepting and non-accepting states. Since Ca, clearly 
accepts the same language as that of Bm, with the same number of states, the 
proof continues exactly the same. 


4 Outline for the positive results 


The rest of the paper is devoted to the positive results. To show that a class is 
identified in the limit using polynomial time and data there are two steps: (i) 
constructing a sample of words TT, of size polynomial in the given acceptor M for 
the language L at hand, the so called, characteristic sample, and (ii) providing a 
learning algorithm that for every given sample T' returns an acceptor consistent 
with that sample, and in addition for any sample T that subsumes Tr returns 
an acceptor that exactly recognizes L. 

Since the construction of the characteristic sample is simpler we start with 
that. We show that the classes IB, IC, IP and IM have characteristic samples 
of size polynomial in the number of states of the acceptor, and that the char- 
acteristic sample can be constructed in polynomial time. The definition of an 
acceptor is composed of two steps: (a) the definition of the automaton and (b) 
the definition of the acceptance condition. Some words are put in the sample 
to help retrieving the automaton and some to help retrieving the acceptance 
condition. We view the characteristic sample as a union of two parts T'Au; (for 
retrieving the automaton) and TA«. (for retrieving the acceptance condition). 
The learning algorithm first constructs the automaton, then retrieves the accep- 
tance condition. 

In Section 5 we discuss the construction of T'44; which is common to all the 
classes we consider, as they all are isomorphic to the automaton of the right 
congruence. In Section 6 we show how an algorithm can retrieve the automaton 
using the labeled words in TA4,;. In Section 7 we discuss the construction of 
T4cc that regards the acceptance condition of the DPA. This part is the most 
involved one. We first associate with a DPA a canonical forest of its strongly 
connected components. From this canonical forest we build the T'4«. part of the 
characteristic sample. In Section 8 we show a learning algorithm that can retrieve 
in polynomial time the acceptance condition of the DPA, from labeled examples 
in Tacc. This implies that IP (as well as its special cases and IC) can be 
learned in the limit from polynomial time and data. In Section 9 we show that 
the class IM can also be learned in the limit from polynomial time and data. 


5 The characteristic sample for the automaton 


In this section we show how to construct the 7T'4,; part of the sample. We first 
show that any two states that are distinguishable in the automaton, are distin- 
guishable by words of length polynomial in the number of states. 
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5.1 Polynomial construction of short distinguishing words 


Let M be an acceptor in one of the classes IB, IC, IP or IM with states Q over 
alphabet X. If M is in one of the first three classes we use max{| X], |Q|} for 
its size measure. If M € IM we use max(||, |Q], m} for its size measure where 
m is the number of sets in the acceptance condition o. We say that states qi 
and q2 of M are distinguishable if there exists a word z € X" that is accepted 
from one but not the other (and that z is a distinguishing word). First we show 
that any two distinguishable states of M are distinguishable by an ultimately 
periodic word of size polynomial in M. Then we show how to use these words 
to construct the T'4,, part of the characteristic sample. 


Proposition 5. If two states of a DMA, DPA, DBA or DCA of n states are 
distinguishable, then they are distinguishable by an ultimately periodic w-word of 
length bounded by n? + nt. 


Proof. We prove that for a DMA M of n states, if two distinct states qı and 
q2 are distinguishable, then they are distinguishable by an ultimately periodic 
w-word of length bounded by n? 4-n^. Since any DPA, DBA or DCA is equivalent 
to an isomorphic DMA, the above result holds also for DPAs, DBAs and DCAs. 
Because qı and q2 are distinguishable, there exists an ultimately periodic 
w-word z(y)” that is accepted from exactly one of the two states. For each 
nonnegative integer k and 4 = 1,2, let q;(k) be the state visited after k symbols 
of z(y)” have been read, starting with state q;. Also, let C; be the set of states 
visited infinitely often by the sequence q;(k), which determines the acceptance 
or rejection of z(y)” from q;. The sequence of pairs (q1(k), qo(k)) for k = 0,1,... 
takes on at most n? different values. Let C be the set of pairs visited infinitely 
often by this sequence. The two projections 71(C) and 72(C) are C1 and C3. 
Let Z be the minimum value for which (qi(k),q2(k)) visits only pairs in C 
for all k > £. Let x’ be the prefix of z(y)” consisting of £ symbols. By removing 
symbols between repeated pairs (gi(k),qo(k)) from x’ we obtain a string u of 
length at most n? that reaches the pair (qı (4, qa(£)) from (q1(0), q2(0)). Let m 
be the minimum value for which (qi(k), qo(k)) for £ € k < m visits all the pairs 
of C and returns to (q1(£), q2(£)), and let y’ be the string from symbol £ to m—1 
of z(y)”. Distinguishing a subsequence of pairs that visits each element of C 
once, we can remove from y’ sequences of symbols between repeated pairs that 
do not include a distinguished pair between them. Thus we obtain a string v 
of length at most |C|n?, that starts at (q1(£), q2(£)), visits all the distinguished 
pairs and returns to the starting pair. Since |C| € n?, the length of u(v)? is at 
most n? -- n^. Also, since the set of states visited infinitely often on input u(v)* 
from q; is C; we have that u(v)" is accepted from exactly one of qı and q2. 


For DPAs as well as DMAs there is a polynomial time algorithm to determine 
whether two states are distinguishable and to find a distinguishing w-word u(v)” 
if they are. This result relies on a polynomial time algorithm to test the equiv- 
alence of two DPAs or two DMAs and return an example u(v)" on which they 
differ if not [9]. Since DBA and DCA are special cases of a DPA, a polynomial 
construction of a distinguishing word applies to them as well. 
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5.2 Constructing the characteristic sample for the automaton 


We now show how to construct the T4ut part of the characteristic sample, given 
an acceptor M in one of the classes IM, IP, IB or IC. Let n be the number of 
states of M. We may assume that every state of M is reachable from the initial 
state q,. The algorithm constructs a set S of n access strings by breadth-first 
search in the transition graph of M such that S is prefix-closed and contains 
exactly one lexicographically least string of shortest possible length reaching 
each state of M from the initial state. Using Proposition 5, the algorithm may 
also construct a set E of at most n? distinguishing experiments that contains for 
each pair qı and q2 of distinct states of M, an w-word u(v)^ of length at most 
n? +n‘ that is accepted from exactly one of the states qı and qo. 

Part one of the sample, T'4,;, consists of all the examples in (S- E)U(S- E- E), 
labeled to be consistent with M. There are at most (1+|2|)n® labeled examples 
in T'A,;, each of length bounded by a polynomial in n. This information is enough 
to allow the polynomial time learning algorithm to reconstruct a transition graph 
isomorphic to that of M. 


Proposition 6. Let M be any deterministic automaton that is consistent with 
the sample Taut. Then M’ has at least n states and if M has exactly n states 
then M’ and M have isomorphic transition graphs. 


Proof. The states of M’ reached from the initial state by the access strings in 
S must all be distinct, because for any pair of different strings s1, 59 € S, there 
exists a word u(v)" € E such that sı - u(v)" and sə- u(v)® have different labels 
in T'4,4;. Thus M’ must have at least n distinct states. 

Assume that M’ has exactly n states. Given the state q of M’ reached by 
some s € S and a symbol c € X, the labeled examples s -ø - u(v)? in Tue for 
all u(v)" € E uniquely determine which string s’ € S corresponds to the state 
reached in M’ from q on input symbol c. Thus the transition graph of M’ is 
isomorphic to the transition graph of M. 


6 Learning the automaton 


Let L denote the language to be learned, and M denote an acceptor of n states 
that is isomorphic to its right congruence automaton and recognizes L. Let the 
input sample of labeled examples be T. We now describe a learning algorithm 
A that makes use of the information in the given sample T' to construct an 
automaton. If T subsumes T'4,, the returned automaton will be isomorphic to 
the acceptor M. 

From the sample T, the algorithm constructs as follows a set E of strings 
that serve as experiments used to distinguish states. For each labeled example 
(u(v)*, I) in T, all of the elements of suffires(u(v)”) are placed in E. Thus if the 
sample T includes T'4,;, then for any pair of states of M the set E includes an 
experiment that distinguishes them. 

Starting with the empty string £, the algorithm attempts to build up a prefix- 
closed set S of finite strings that reach different states of M from the initial state. 
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Initially, Sı = {£}. After S, has been constructed, the algorithm attempts to 
determine, for each s € S; and each symbol o € X in the ordering defined on X, 
whether s-o reaches the same state as some string already in S% or a new state. 
If for each string s’ in Sp, there exists some u(v)” € E such that the sample 
T has different labels for s- ø- u(v)? and s' - u(v)", then this is evidence that 
s: o reaches a new state, and Sk+1 is set to Sk U {s - o]. If no such pair s and 
o is found, then the final set S is Sj. Because M has only n states, this case 
is reached with k < n. If the sample T subsumes 744; then this process will 
discover exactly the strings reaching all n states of M used in the construction 
of T'Au;; otherwise, it may terminate early. 

In the second phase, the algorithm uses the strings in S as names for states 
and constructs a transition function 6’ using S and E. For each s € S anda € X, 
we know that there is at least one s’ € S such that there is no u(v)" € E for 
which s-o- u(v)^ and s'- u(v)? have different labels in T (possibly because one 
or more of these examples are not in T at all.) The algorithm selects one such 
s' and defines ó'(s,c) = s'. If the strings in S actually reach all the states of 
M and the choice of s’ is unique in each case, then 6’ will be isomorphic to the 
transition function of M. This will be the case if the sample T includes Ty 
because then among the elements of E will be experiments that distinguish any 
pair of states of M; otherwise, 6’ may not be correct. 


7 Characteristic sample for a DPA 


The construction of T'4c., the part of the characteristic sample used for retrieving 
the accepting condition of a DPA, builds on the construction of a forest of SCCs 
associated with a given DPA, which we term the canonical forest. Its properties 
and its construction are described next. 


7.1 Constructing the canonical forest of a DPA 


We start with some definition and simple claims.Let P = (27,Q,q,,6,%) be 
a deterministic parity acceptor (DPA). A set of states C C Q is a strongly 
connected component (SCC) if and only if C is nonempty and for every qi, g2 € C, 
there exists a nonempty string v € X* such that ó(q1, v) = q2 and for all u < v, 
ó(q1, u) € C. Note that an SCC need not be maximal, and that a singleton {q} 
is an SCC if and only if the state q has a self-loop, that is, 0(q, 0) = q for some 
c € X. For any w-word w, the set C of states visited infinitely often in the run 
of P on input w is an SCC of P. 


Claim 7. If Cy and C3 are SCCs of P and C1 N C3 z 0, then C4 U C» is also 
an SCC of P. 


If P is a DPA and R C Q is any set of states, define SCCs(R) to be the set 
of all C such that C C R and C is an SCC of P. Also define marSCCs(R) to 
be the maximal elements of SCCs(R) with respect to the subset ordering. 
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Claim 8. If P is a DPA and R C Q is any set of states, then the elements of 
mazSCCs(R) are pairwise disjoint, and every set C € SCCSs(R) is a subset of 
exactly one element of maxSCCs(R). 


If P is a DPA, we extend its coloring function & to any nonempty set R of 
states by &(R) = min{«(q) | q € R}. We define the parity of R to be 1 if &(R) is 
odd, and 0 otherwise. For an w-word w, if the SCC C is the set of states visited 
infinitely often in the run of P on w, then w is accepted by P iff the parity of 
C is 1. Note that the union of two sets of parity b is also of parity b. For any 
set of states R C Q, we define minStates(R) to be the set of states q € R such 
that &(q) = &(R), that is, the states of R that are assigned the minimum color 
among all states of R. 


The Canonical Forest Using these definitions we can show that there exists 
a forest associated with a DPA that has the following interesting properties. We 
provide an example for a canonical forest for a given DPA at the end of the 
current subsection. 


Theorem 9. Let P = (X, Q,qo,ô, K) be a DPA. There exists a canonical forest 
F*(P) that is unique up to isomorphism and has the following properties. 


1. There are at most |Q| nodes in F*(P), each one a distinct SCC of P. 

2. The root nodes of F*(P) are the elements of maxSCCs(Q). 

8. The children of a node C of parity b are the maximal SCCs C' C C of parity 
1- b. 

4. The children of a node C are pairwise disjoint and their union is a proper 
subset of C. 

5. For any SCC D of P, there is a unique node C in F*(P) such that D C C 
and D is not a subset of any of the children of C, and C and D have the 
same parity. 


Proof. The root nodes of F*(P) are the elements of mazSCCs(Q) and are SCCs 
that are pairwise disjoint, by Claim 8. Let C be one of them, and assume its 
parity is b. Let T' be the set of SCCs that are subsets of C and of parity 1 — b. 
If T = 0 then C has no children and is a leaf of F*(P). Otherwise, the children 
of C are the maximal elements of T' with respect to the subset ordering. The 
children of C must be pairwise disjoint because if they share a state, then their 
union is an SCC contained in C of parity 1 — b and is a proper superset of at 
least one of them, violating maximality. No child of C can contain an element 
of minStates(C) because otherwise the parity of the child would be b. Thus 
the union of the children of C must be a proper subset of C. These conditions 
imply that there are at most |Q| nodes in the forest, and that it is unique up to 
isomorphism. 

Let D be any SCC of P. Then D € SCCs(Q), so by Claim 8, because the 
roots of F* (P) are the elements of mazSCCs(Q), there is a unique root node Co 
such that D C Co. Suppose the parity of Co is b. If D is not a subset of any 
of the children of Co, then it cannot have parity 1 — b, so the choice C = Co 
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satisfies the required condition. If, however, D is a subset of some child C, of 
Co, then because the children of Co are pairwise disjoint, C, is the only child 
of Co that contains D. Again, if D is not a subset of any of the children of C, 
then D and C; must have the same parity, and the choice C = C satisfies the 
condition. Otherwise, we continue down the tree rooted at Co until a node C is 
found that satisfies the condition. Note that if we arrive at a leaf Ck, then D is 
not a subset of any of the children of C; (there are none) and D must have the 
same parity as C; because otherwise Cp would have at least one child. 


The Canonical Coloring The canonical forest F*(P) allows us to define a 
canonical coloring &* for P, as follows. The states in (Q \ U maxSCCs(Q)) are 
not contained in any SCC of P and do not affect the acceptance or rejection 
of any w-word. For definiteness, we assign them «*(q) = 0. For each node C of 
F*(P), we define A(C) to be the set of states of C that are not contained in the 
union of the children of C. For a root node C of parity b, we define &*(q) = 6 for 
all q € A(C). Let C be an arbitrary node of F*(P). If the states of A(C) have 
been assigned color k by &* and D is a child of C, then the states of A(D) are 
assigned color k + 1 by &*. We observe that if qı € A(C) and qs is in a child of 
C, then &*(gq1) < &*(go), and &*(q1) is of the same parity as C. 


Theorem 10. Let P = (X, Q,qo,ô, K) be a DPA, and P' be P with the canonical 
coloring &* for P in place of K. Then P and P’ recognize the same w-language. 


Proof. Let w be an w-word and let D be the SCC consisting of the states visited 
infinitely often in the run of P (and also of P’) on input w. Let C be the unique 
node of F*(P) such that D is a subset of C and is not a subset of any of the 
children of C. Thus D contains at least one q € A(C). In P the parity of D is 
the same as the parity of C, which is the same as the parity of «*(q), which is 
equal to the parity of D in P’. Thus either both 7 and P’ accept w or both 
reject w. 


Computing the Canonical Forest We now show that, given a DPA P — 
(2,Q. 0,0, K), we can compute the canonical forest of P in polynomial time. 
We first define a (possibly non-canonical) forest Fẹ (P) using the given coloring 
kK. The root nodes are the elements of maxSCCs(Q), the set of all maximal SCCs 
of P. Once we have defined a node C of the forest, the children are the elements 
of the set mazSCCs(C \ minStates(C)), that is, the maximal SCCs contained 
in C with the set of states of minimum color removed. If this set is empty, the 
node has no children and is a leaf. Note that in contrast to the case of the 
canonical forest, in F,(7) the children of a node are not constrained to be of 
parity opposite to that of the parent. 

By construction each node in the forest F,(P) is an SCC of P. If D isa 
descendant of C in the forest, then D is a proper subset of C, and (C) < «(D). 
Because the roots are pairwise disjoint and the children of any node are pairwise 
disjoint, the sets minStates(C) for nodes C in the forest are pairwise disjoint and 
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Fig. 2: (a) Transition graph of DPA P with states colored by &. (b) Non-canonical forest F,(P), 
with parities of nodes. (c) Canonical forest F* (P), with parities of nodes. (d) Transition graph of P 
with the canonical coloring &*. 

nonempty, so there are at most |Q| nodes. Because a linear time algorithm for 
computing strongly connected components can be used to compute the children 
of a node, the forest F (P) may be computed in polynomial time in the size of 
the given DPA 7. 

To obtain the canonical forest F*(P) from the possibly non-canonical forest 
FA(P), we may repeatedly merge pairs of adjacent nodes of the same parity until 
every pair of adjacent nodes are of different parity. That is, if C is a node of 
parity b and D is a child of C of parity b, then D C C, and we merge D into C 
by deleting D and making all the children of D direct children of C. Repeating 
this operation until there are no parent/child pairs of equal parity yields the 
canonical forest F*(P). This computation can be done in polynomial time. 

Note that to obtain a canonical forest for a given DBA (resp., DCA) we can 
simply first color states in F by 1 (resp. 0) and in QV F by 2 (resp., 1) and then 
compute the canonical forest for the resulting DPA. In both cases the canonical 
forest will be of depth at most two, since in DBA an accepting SCC cannot be 
subsumed by a rejecting SCC (and vice versa in DCA). 


An Example Figure 2(a) shows the transition graph of an example DPA P with 
states a through m, labeled by the colors assigned by «. There is a directed edge 
from state qı to state q2 if there exists a symbol e € X such that ó(q1,0) = q2. 
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Figure 2(b) shows the non-canonical SCC forest F,(P) of P, with the nodes 
labeled by their parities. Figure 2(c) shows the canonical SCC forest F*(P) of 
P, with the nodes labeled by their parities. Figure 2(d) shows the transition 
graph of P re-colored using the canonical coloring &*. 


7.2 Constructing the characteristic sample for a DPA 

We can now construct T'A«., the second part of the characteristic sample for a 
DPA 7. The sample T'A«. consists of one example u(v)" for each node C of the 
canonical forest F*(P), where u is a string that reaches a state q in C from the 
initial state qo, and v is a nonempty string that, starting from q, visits every 
state of C and no state outside of C and returns to q. The length of the example 
u(v)" can be taken to be bounded by n + n?. The example u(v)” is labeled 1 
if it is accepted by P and otherwise is labeled 0. Then Tc contains at most 
n labeled examples, each of length polynomial in n. The final characteristic 
sample for L = [P] is Tr, = Tau U Tac. The sample Tr, contains O(| |n?) 
labeled examples, each of length at most O(n*), which is polynomial in size(L). 


8 The learning algorithm for a DPA 


We can now describe the learning algorithm A that makes use of the informa- 
tion in TT. Similar to Gold's construction, the algorithm optimistically assumes 
that the sample includes a characteristic sample, and if that assumption fails to 
produce an acceptor consistent with the sample, the algorithm defaults to pro- 
ducing a table-lookup acceptor to ensure that its hypothesis is consistent with 
the sample. The algorithm we describe is sufficient to establish the theoretical 
results, but for practical applications much more effort should be expended to 
find good heuristic choices to avoid defaulting too easily. 

Let L denote the language to be learned, and P denote a DPA of n states 
that is isomorphic to its right congruence automaton and recognizes L. The 
first and second phases of the algorithm are as described in Section 6: in the 
first phase the algorithm builds the set S of states of the automaton, and in the 
second step it builds the transition relation 6’. In the third phase, the acceptance 
(namely the coloring) is determined. In this phase, the algorithm may default to 
returning the table-lookup DPA for T. We first explain the construction of the 
table-lookup DPA then describe the third phase. 


2 z 
A table-lookup DPA A table-lookup DPA k CORO 
for a given sample T is constructed by find- © (9) a ©) b 
ing the shortest prefix of each example u(v)” ° : 
2 newer” TP fre 
Y: b a 


in T that distinguishes it from all other ex- 
amples in T and placing these prefixes in a od i EY ) 

trie-like structure. At each leaf of the trie is : 

a structure accepting (or rejecting, depending Fig. 3:  Table-lookup DPA for T = 
on the label of the example) the appropriate (i05) 1); ((ab)®, 1), (ab(baa)®, 0)}. 
suffix of the unique example that arrives at that leaf. By Claim 1, this DPA 
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can be constructed in time polynomial in the length of the sample T. Note that 
this construction is easily modified to give a DBA, DCA or DMA instead of a 
DPA. As an example, for the sample T = {(a(b)”, 1), ((ab)”, 1), (ab(baa)”, 0)}, 
the corresponding prefixes are abbb, aba, and abba, and the table-lookup DPA 
for T is shown in Figure 3, with states labeled by colors 0 and 1. 


Determining the coloring In the third phase, the algorithm attempts to 
define a coloring of the states in S. The algorithm constructs the set Z of all 
subsets C of S such that for some labeled example (u(v)”,/) in T, the subset C is 
the set of elements of S that are visited infinitely often in the run on input u(v)” 
starting at £ using the transition function 0’. If in this process two examples with 
different labels are found to yield the same set C, the learning algorithm defaults 
to the table-lookup DPA for T. Otherwise, each set C in Z is associated with 
the label of the example(s) that yield C. The set Z is partially ordered by the 
subset relation. The learning algorithm then attempts to construct a forest F” 
with nodes that are elements of Z, corresponding to the canonical forest of P. 
Initially, F" contains as roots all the maximal elements of Z. If these are not 
pairwise disjoint, it defaults to the table-lookup DPA for T. Otherwise, for each 
unprocessed element C in F’, it computes the set of all D € Z such that D C C, 
D has the opposite label to C, and D is maximal with these properties, and 
makes D a child of C. When all the children of a node C have been determined, 
the algorithm checks two conditions: (1) that the children of C are pairwise 
disjoint, and (2) there is at least one s € C that is not in any child of C. If either 
of these conditions fail, then it defaults to the table-lookup DPA for T. If both 
conditions are satisfied, then the node C is marked as processed. When there 
are no more unprocessed nodes, the construction of F’ is complete. Note that 
F’ can have at most n nodes, because S has at most n elements. 

When the construction of F’ completes, for each node C in F’ let A(C) 
denote the elements of C that do not appear in any of its children. Then the 
learning algorithm assigns colors to the elements of S starting from the roots 
of F’, as follows. If C is a root with label l, then «’(s) = l for all s € A(C). If 
the elements of A(C) have been assigned color k and D is a child of C, then 
K'(s) = k +1 for all s € A(D). When this process is complete, any uncolored 
strings s are assigned &'(s) = 0. If the resulting DPA P’ is consistent with the 
sample T, the learning algorithm outputs P’ and halts. If the sample T includes 
both Tut (to specify the automaton) and Tce (to specify the coloring), then 
F” will be isomorphic to the canonical forest F*(P) and x’ will correspond to 
the canonical coloring «*, and P’ will recognize the target language L. 

If the process described above does not result in a DPA that is consistent 

with the sample T', then the algorithm defaults to constructing the table-lookup 
DPA for T. 
The learning algorithm also works for the classes IB and IC: In the case of 
and IC we need to define a set F rather than a coloring &. After constructing 
the forest, the set F is determined to contain the states in the root nodes that 
are not in the leaves. Thus we have the following. 
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Theorem 11. The classes IB, IC and IP are identifiable in the limit using poly- 
nomial time and data. Moreover, characteristic samples can be computed in poly- 
nomial time. 


A corollary of Theorem 11 is that the class of languages recognized by der- 
ministic weak parity acceptors (IDWIPA) which was shown to be polynomially 
learnable using membership and equivalence queries in [24] is identified in the 
limit using polynomial time and data. This class (which is equivalent to the in- 
tersection of classes DBA N DCA) was shown to be a subset of IM in [30], and 
to be a subset of IP in [4]. 


Corollary 2. The class DWPA is identifiable in the limit using polynomial time 
and data. Moreover, characteristic samples can be computed in polynomial time. 


9 The sample T4«. and the learning algorithm for a DMA 


The above results can be extended to the class IM. Recall that we define the 
size measure for a DMA to be max{| X|, |Q|, m}, where m is the number of sets 
in the acceptance condition. For the characteristic sample 7r, T'4,; remains the 
same, but Tce contains for each accepting set C, an example u(v)* for which 
C is the set of states visited infinitely often. In the learning algorithm, the 
construction of the transition function remains the same. Instead of attempting 
to construct a coloring function, the learning algorithm finds for each labeled 
example (u(v)”,1) € T, the set C of states s that are visited infinitely often 
on input u(v)^ starting from £ and using the transition function 6’, and adds 
C to the acceptance condition. If the construction does not result in à DMA 
consistent with T, then it defaults to producing a table-lookup DMA for T. 
Because in addition, as stated in Section 5.1, a characteristic samples can be 
computed in polynomial time, we have the following. 


Theorem 12. The class IM is identifiable in the limit using polynomial time 
and data. Moreover, a characteristic sample can be computed in polynomial time. 


10 Discussion 


We have shown that the non-deterministic classes of w-automata NBA, NPA, 
NMA and NCA cannot be identified in the limit using polynomial data. A nega- 
tive result regarding query learning of the first three classes was recently obtained 
in [3]. That result makes a plausible assumption of cryptographic hardness, which 
is not required here. On the positive side we have shown that the classes IB, IC, 
IP and IM can be identified in the limit using polynomial time and data. And 
moreover, a characteristic sample can be constructed in polynomial time. The 
construction builds on the definition of a canonical forest for a DPA which may 
be of use in other contexts as well. T'he question whether the deterministic classes 
DBA, DPA, DMA and DCA can be polynomially learned in the limit remains 
open. 
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Abstract. This report describes the 2020 Competition on Software Veri- 
fication (SV-COMP), the 9^ edition of a series of comparative evaluations 
of fully automatic software verifiers for C and Java programs. The compe- 
tition provides a snapshot of the current state of the art in the area, and 
has a strong focus on replicability of its results. The competition was based 
on 11052 verification tasks for C programs and 416 verification tasks 
for Java programs. Each verification task consisted of a program and a 
property (reachability, memory safety, overflows, termination). SV-COMP 
2020 had 28 participating verification systems from 11 countries. 
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1 Introduction 


The Competition on Software Verification (SV- COMP) serves as the showcase of 
the state of the art in the area of automatic software verification. SV-COMP 2020 
is the 9'^ edition of the competition and presents an overview of the currently 
achieved results by tool implementations that are based on the most recent ideas, 
concepts, and algorithms for fully automatic verification. This competition report 
describes the (updated) rules and definitions, presents the competition results, 
and discusses some interesting facts about the execution of the competition 
experiments. The competition measures its own success by evaluating whether 
the objectives of the competition were achieved. To the objectives discussed 
earlier (1-4 [14]) we add two further objectives that deserve mentioning (5-6): 


1. provide an overview of the state of the art in software-verification technology 
and increase visibility of the most recent software verifiers, 

2. establish a repository of software-verification tasks that is publicly available 
for free use as standard benchmark suite for evaluating verification software, 

3. establish standards that make it possible to compare different verification 
tools, including a property language and formats for the results, 

4. accelerate the transfer of new verification technology to industrial practice 
by identifying the strengths of the various verifiers on a diverse set of tasks, 

5. educate PhD students and others on performing replicable benchmarking, 
packaging tools, and running robust and accurate research experiments, and 

6. provide research teams that do not have sufficient computing resources with 
the opportunity to obtain experimental results on large benchmark sets. 


(9 The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 347—367, 2020. 
https://doi.org/10.1007/978-3-030-45237-7. 21 
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We now discuss the outcome of SV-COMP 2020 with respect to these objec- 
tives: (1) There were 28 participating software systems from 11 countries, using 
many different technologies (cf. Table 6). SV-COMP is considered an important 
event in the verification community. (2) The sv-benchmarks repository is consid- 
ered one of the largest and most diverse collections of verification tasks in C and 
Java. The community dedicates a lot of maintenance effort, as the issue tracker ! 
and the pull requests? on GitHub show. (3) SV-COMP has established a format 
for defining verification tasks, a standard specification language, and a set of 
functions to express non-deterministic values. Verification results are validated 
using verification witnesses and six different validators. (4) We received positive 
feedback from industry, reporting that it is helpful to look up the newest and best 
available verification tools, regarding the categories of interest. There are several 
participating systems from industry since 2017. (5) Participating in SV-COMP 
is also a challenge because the entry requirements are strict: the tools have to 
be packaged such that all necessary non-standard components are contained, 
the tools need to provide meaningful log output, the tool parameters have to be 
specified in the BENCHExeEc benchmark-definition format, and a tool-info module 
needs to be implemented. All experiments are required to be fully replicable. 
It is a motivating experience to observe the learning of first-time participants. 
(6) Running large-scale performance experiments requires an infrastructure with 
considerable computing resources — which are not necessarily available to all 
tool developers. Through this competition and the preruns, the participants get 
the opportunity to repeatedly run experiments on the full benchmark set of 
verification tasks of the competition. The preruns and final run sum up to over 
one million verification runs and ten million witness-validation runs. 


Related Competitions. It is well-understood that competitions are an impor- 
tant evaluation method, and there are many other competitions in the field of 
formal methods. The TOOLympics ? [7] event in 2019 (part of the 25-years-of- 
'TACAS celebration) presented 16 competitions in the area. Most closely related 
are the competitions RERS * [45] and VerifyThis? [46]. While SV-COMP ê per- 
forms replicable experiments in a controlled environment (dedicated resources, 
resource limits), the RERS Challenges give more room for exploring combina- 
tions of interactive with automatic approaches without limits on the resources, 
and the VerifyThis Competition focuses on evaluating approaches and ideas 
rather than on fully automatic verification. 

Large benchmark collections are extremely important to make approaches 
comparable and to agree on what constitutes interesting problems to solve. 
There are other large benchmark collections as well (e.g., by SPEC 7), but the 


https:/ /github.com/sosy-lab/sv-benchmarks/issues 
https: //github.com/sosy-lab/sv-benchmarks/pulls 
https: //tacas.info/toolympics.php 

http: //rers-challenge.org 

5 http: //etaps2016.verifythis.org 

https: //sv-comp.sosy-lab.org 

T https: //www.spec.org 
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sv-benchmarks suite? is (a) free of charge, and (b) tailored to the state of the 
art in software verification. Benchmark repositories of various competitions and 
challenges also contribute to each other. For example, the sv-benchmarks suite 
contains programs that were originally used in RERS °, in termCOMP !^, and 
in VerifyThis !!. There is a flow of benchmarks in the other direction as well: 
The competition SMT-COMP [32] uses SMT formulas that were generated from 
programs of the sv-benchmarks collection. For example, the k-induction engine 
of CPACHECKER was used to generate more than 1000 SMT formulas for the 
quantifier-free theory of arrays and bit-vectors (QF_ ABV) '?. 


2 Organization, Definitions, Formats, and Rules 


Procedure. SV-COMP 2020's overall organization did not change in comparison 
to the earlier editions [8, 9, 10, 11, 12, 13, 14]. SV-COMP is an open competition, 
where all verification tasks are known before the submission of the participating 
verifiers, which is necessary due to the complexity of the C language. During the 
benchmark submission phase, new verification tasks were collected, classified, and 
added to the existing benchmark suite (i.e., SV-COMP uses an accumulating 
benchmark suite), during the training phase, the teams inspected the verification 
tasks and trained their verifiers (also, the verification tasks received fixes and 
quality improvement), and during the evaluation phase, verification runs were 
preformed with all competition candidates, and the system descriptions and 
archives were reviewed by the competition jury. The participants received the 
results of their verifier directly via e-mail, and after a few days of inspection, the 
results were publicly announced on the competition web site. The Competition 
Jury consisted again of the chair and one member of each participating team. 
Team representatives of the jury are listed in Table 5. 


Qualification and License Requirements. As a new feature in SV-COMP 
2020, a rule was introduced that allows the organizer to reuse systems that 
participated in previous years, and to enter new systems, provided that the 
developers were given the chance to contribute a submission themselves (both 
options were not used this time). Starting 2018, SV-COMP required that the 
verifier must be publicly available for download and has a license that 


(i) allows replication and evaluation by anybody (including results publication), 
(ii) does not restrict the usage of the verifier output (log files, witnesses), and 
(iii) allows any kind of (re-)distribution of the unmodified verifier archive. 


8 https://github.com/sosy-lab/sv-benchmarks 
d https: //github.com/sosy-lab/sv-benchmarks/blob/svcomp20/c/eca-rers2012/README.txt 
10 https: //github.com/sosy-lab/sv-benchmarks/blob/svcomp20/c/termination-restricted-15/ 
README.txt 
11 https: //github.com/sosy-lab/sv- benchmarks /blob/svcomp20/c/verifythis/README.txt 


12 https:/ /clc- gitlab.cs.uiowa.edu:2443/SM' T-LIB-benchmarks-inc/QF  ABV /tree/master/ 
20190307-CPAchecker kInduction-SoSy Lab 
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format version: '1.0' 


# old file name: floppy true—-unreach-call true—-valid-memsafety.i.cil.c 
input files: 'floppy.i.cil-3.c" 


properties: 
— property file: ../properties/unreach-call.prp 
expected verdict: true 
— property file: ../properties/valid-memsafety.prp 
expected verdict: false 
subproperty: valid-memtrack 


oonan AUNKE 


H e 
e oO 


Fig. 1: Example task definition for program floppy.i.cil-3.c 


Validation of Results. The validation of the results based on verification 
witnesses |19, 20] was done as in previous years (2017-2019), mandatory for both 
answers TRUE or FALSE. A few categories were excluded from validation if the 
validators did not sufficiently support a certain kind of program or property. Two 
new validators participated in SV-COMP 2020: Nrrwrr [66] and Mera Vat [25]. 


Verification Tasks — Explicit Task-Definition Files. The notion of verifica- 
tion tasks did not change and we refer to previous reports for more details [10, 13]. 
We developed a new format for task definitions that was used for the Java cate- 
gory already in SV-COMP 2019. Technically, we need a verification task (a pair 
of a program and a specification to verify) to feed as input to the verifier, and 
an expected result against which we check the answer that the verifier returns. 
Previously, the above-mentioned three components were specified in the file name 
of the program; now all the information is stored in an extra file that contains a 
structured definition of the verification tasks for a program. For each program, the 
repository contains the program file and a task-definition file. Consider an exam- 
ple program that is available under the name floppy.i.cil-3.c: This program 
comes now with its task-definition file floppy.i.cil-3.yml. Figure 1 shows 
this task definition. The new format was used in SV-COMP 2019 for the Java 
category [14] and in the competition on software testing, Test-Comp 2019 [15]. 

'The task definition uses the YAML format as underlying structured data 
format. It contains a version id of the format (line 1) and can contain com- 
ments (line 3). The field input. files specifies the input program (exam- 
ple: ’floppy.i.cil-3.c’), which is either one file or a list of files. The field 
properties lists all properties of the specification for this program. Each 
property has a field property. file that specifies the property file (example: 
../properties/unreach-call.prp) and a field expected verdict that spec- 
ifies the expected result (example: true). 


Categories, Properties, Scoring Schema, and Ranking. The categories 
are listed in Tables 7 and 8 and described in detail on the competition web site. !? 
Figure 2 shows the category composition. For the definition of the properties 
and the property format, we refer to the 2015 competition report [11]. All 
specifications are available in the directory c/properties/ of the benchmark 


13 https: //sv-comp.sosy-lab.org/2020/benchmarks.php 
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Fig. 2: Category structure for SV-COMP 2020; category C-Falsification Overall 
contains all verification tasks of C-Overall without Termination; Java-Overall 
contains all Java verification tasks 
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Table 1: Properties used in SV-COMP 2020 (unchanged since 2019 [14]) 


Formula 


Interpretation 


G ! call(foo()) 
G valid-free 


G valid-deref 


G valid-memtrack 


A call to function foo is not reachable on any finite execution. 
All memory deallocations are valid (counterexample: invalid free). 
More precisely: There exists no finite execution of the program 
during which an invalid memory deallocation occurs. 

All pointer dereferences are valid (counterexample: invalid 
dereference). More precisely: There exists no finite execution of 
the program during which an invalid pointer dereference occurs. 
All allocated memory is tracked, i.e., pointed to or deallocated 
(counterexample: memory leak). More precisely: There exists 

no finite execution of the program during which the program lost 
track of some previously allocated memory. 


G valid-memcleanup All allocated memory is deallocated before the program 


F end 


terminates. In addition to valid-memtrack: There exists 

no finite execution of the program during which the program 
terminates but still points to allocated memory. 

(Comparison to Valgrind: This property can be violated even 
if Valgrind reports ’still reachable’.) 

All program executions are finite and end on proposition end, 
which marks all program exits (counterexample: infinite loop). 
More precisely: There exists no execution of the program on 
which the program never terminates. 


Table 2: Scoring schema for SV-COMP 2020 (unchanged since 2017 [13]) 


Reported result 


Points Description 


UNKNOWN 
FALSE correct 


FALSE incorrect 
TRUE correct 


TRUE correct 
unconfirmed 
TRUE incorrect 


0 Failure to compute verification result 
+1 Violation of property in program was correctly found 
and a validator confirmed the result based on a witness 
—16 Violation reported but property holds (false alarm) 
+2 Program correctly reported to satisfy property 
and a validator confirmed the result based on a witness 
+1 Program correctly reported to satisfy property, 

but the witness was not confirmed by a validator 

—32 Incorrect program reported as correct (wrong proof) 


repository. Table 1 lists the properties and their syntactical representation as 
overview. Property G valid-memcleanup, and thus, the category MemCleanup, 
was used for the first time in SV-COMP 2019. The categories AWS-C-Common 
and OpenBSD were added for SV-COMP 2020. 

'The scoring schema is identical for SV-COMP 2017-2020: Table 2 provides 
the overview and Fig. 3 visually illustrates the score assignment for one prop- 
erty. The scoring schema still contains the special rule for unconfirmed cor- 
rect results for expected result TRUE that was introduced in the transitioning 
phase: one point is assigned if the answer matches the expected result but 
the witness was not confirmed. 
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true (witness confirmed) >| 2 | 
WITNESS. VALIDATOR unconfirmed (false, unknown, or ressources exhausted) 
true penes NM 
invalid (error in witness syntax) 
oom RN ) id d 


VERIFIER | false 


false-unreach 


true 


unknown 
VERIFIER ae | invalid (error in witness syntax) -» EO 


WITNESS VALIDATOR unconfirmed (true, unknown, or ressources exhausted) ro | 
false (witness confirmed) 


Fig. 3: Visualization of the scoring schema for the reachability property (from [13], 
© Springer-Verlag) 


The ranking was again decided based on the sum of points (normalized for 
meta categories). In case of a tie, the ranking was decided based on success run 
time, which is the total CPU time over all verification tasks for which the verifier 
reported a correct verification result. Opt-out from Categories and Score Nor- 
malization for Meta Categories was done as described previously [9] (page 597). 


3 Reproducibility 


All major components used in the competition are available in public version 
repositories. This allows independent replication of the SV-COMP experiments. 
An overview of the components that contribute to the reproducible setup of SV- 
COMP is provided in Fig. 4, and the details are given in Table 3. The SV-COMP 
2016 report [12] describes all components of the SV-COMP organization and how 
we ensure that all parts are publicly available for maximal replicability. 

We have published the competition artifacts at Zenodo to guarantee their 
long-term availability and immutability. These artifacts comprise the verification 
tasks, the produced competition results, and the produced verification witnesses. 
The DOIs and references are given in Table 4. The archive for the competition 
results includes the raw results in BENCHExEc’s XML exchange format, the log 
output of the verifiers and validators, and a mapping from files names to SHA-256 
hashes. The hashes of the files are useful for validating the exact contents of a file, 
and accessing the files inside the archive that contains the verification witnesses. 

'To provide a more transparent way of accessing the exact versions of the 
verifiers that were used in the competition, all verifier archives are stored in a 
public Git repository. GITLAB was used to host the repository for the verifier 
archives due to its generous repository size limit of 10 GB. The final size of 
the Git repository is 5.78 GB. 
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(b) Benchmark Definitions (c) Tool-Info Modules (d) Verifier Archives 


(e) Verification Run 


(f) Correctness 
TRUE Witness 


(a) Verification Tasks 


(f) Violation 
Witness 


FALSE UNKNOWN 


Fig. 4: SV-COMP components and the execution flow 


Table 3: Publicly available components for replicating SV-COMP 2020 


Component Fig. 4 Repository Version 
Verification Tasks (a) github.com/sosy-lab/sv-benchmarks svcomp20 
Benchmark Definitions (b) github.com/sosy-lab/sv-comp svcomp20 
Tool-Info Modules (c) github.com/sosy-lab/benchexec 2.5.1 
Verifier Archives (d) gitlab.com/sosy-lab/sv-comp/archives-2020 svcomp20 
Benchmarking (e) github.com/sosy-lab/benchexec 2.5.1 
Witness Format (f) github.com/sosy-lab/sv-witnesses svcomp20 


4 Results and Discussion 


The results of the competition experiments represent the state of the art in fully 
automatic software-verification tools. The report shows the results, in terms of 
effectiveness (number of verification tasks that can be solved and correctness of 
the results, as accumulated in the score) and efficiency (resource consumption 
in terms of CPU time). The results are presented in the same way as in last 
years, such that the improvements compared to last year are easy to identify. The 
results presented in this report were inspected and approved by the participating 
teams. We now discuss the highlights of the results. 


Participating Verifiers. Table 5 and the competition web site 4 provide an 
overview of the participating verification systems. Table 6 lists the algorithms 
and techniques that are used in the verification tools. 


Computing Resources. The resource limits were the same as in the previous 
competitions [12]: Each verification run was limited to 8 processing units (cores), 
15 GB of memory, and 15 min of CPU time. The witness validation was limited 
to 2 processing units, 7GB of memory, and 1.5min of CPU time for violation 
witnesses and 15min of CPU time for correctness witnesses. The machines 
for running the experiments are part of a compute cluster that consists of 


15 https:/ /sv-comp.sosy-lab.org/2020/systems.php 
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Content 


DOI 


Table 4: Artifacts published for SV-COMP 2020 


Reference 


Verification Tasks 
Competition Results 
Verification Witnesses 


10.5281/zenodo.3633334 [17] 
10.5281/zenodo.3630205 [16] 
10.5281/zenodo.3630188 [18] 


'Table 5: Competition candidates with tool references and representing jury members 


Participant Ref. Jury member Affiliation 

2Ls [26,55] "Viktor Malik BUT, Brno, Czechia 

Brick Lei Bu Nanjing U., China 

CBMC [51] Michael Tautschnig Amazon Web Services, UK 
COASTAL [67] Willem Visser Stellenbosch U., South Africa 
CPA-BAM-BnB [3,68] | Vadim Mutilin ISP RAS, Russia 

CPA-Lockaron [4] Pavel Andrianov ISP RAS, Russia 

CPA-SEQ [22,36] Martin Spiessl LMU Munich, Germany 
DARTAGNAN [40,53] Hernán Ponce de León Bundeswehr U. Munich, Germany 
Divine [6,52] Henrich Lauko Masaryk U., Czechia 

EsBMcC [38,39] Felipe R. Monteiro Federal U. of Amazonas, Brazil 
GACAL [61] Benjamin Quiring Northeastern U., USA 

JAVA- RANGER [65] Vaibhav Sharma U. of Minnesota, USA 

JAYHORN [49,50] Philipp Ruemmer Uppsala U., Sweden 

JBMc [33,34] Peter Schrammel U. of Sussex, UK 

JDART [54,56] Falk Howar TU Dortmund, Germany 
Lazy-CSEQ [47,48] Omar Inverso Gran Sasso Science Inst., Italy 
Map2Cueck [63,64] Herbert Rocha Federal U. of Roraima, Brazil 
PESCo [35,62] Cedric Richter Paderborn U., Germany 

PINAKA [30] Saurabh Joshi IIT Hyderabad, India 
PnREDATORHP [44,59] Veronika Soková BUT, Brno, Czechia 

SPF [57,60] Willem Visser Amazon, USA 

SYMBIOTIC [28,29] Marek Chalupa Masaryk U., Czechia 

UAUTOMIZER [42,43] Matthias Heizmann U. of Freiburg, Germany 

UKoJAK [27,58] Alexander Nutz U. of Freiburg, Germany 

U'TAIPAN [37,41] Daniel Dietsch U. of Freiburg, Germany 
VERIABS [1,2] Priyanka Darke Tata Consultancy Services, India 
VERIFUZZ [31] Raveendra Kumar M. Tata Consultancy Services, India 
YoaAR-CBMC [70,71] Liangze Yin Nat. U. of Defense Techn., China 


168 machines; each verification run was executed on an otherwise completely 
unloaded, dedicated machine, in order to achieve precise measurements. Each 
machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, 
a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system 
(x86 64-linux, Ubuntu 18.04 with Linux kernel 4.15). We used BENcHExzc [23] 
to measure and control computing resources (CPU time, memory, CPU energy) 
and VERIFIERCLOUD P? to distribute, install, run, and clean-up verification runs, 


15 https:/ /vcloud.sosy-lab.org 
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Table 6: Algorithms and techniques that the competition candidates offer 
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and to collect the results. The values for time and energy are accumulated 
over all cores of the CPU. To measure the CPU energy, we use CPU ENERGY 
METER [24] (integrated in BencHExec [23]). 


One complete verification execution of the competition consisted of 
138 074 verification runs (each verifier on each verification task of the selected 
categories according to the opt-outs), consuming 491 days of CPU time and 
130kWh of CPU energy (without validation). Witness-based result validation 
required 684858 validation runs (each validator on each verification task for 
categories with witness validation, and for each verifier), consuming 311 days 
of CPU time. Each tool was executed several times, in order to make sure no 
installation issues occur during the execution. Including preruns, the infrastruc- 
ture managed a total of 1018 781 verification runs consuming 4.8 years of CPU 
time, and 10 705 227 validation runs consuming 6.9 years of CPU time. 


Quantitative Results. Table 7 presents the quantitative overview of all tools 
and all categories. The head row mentions the category, the maximal score for the 
category, and the number of verification tasks. The tools are listed in alphabetical 
order; every table row lists the scores of one verifier. We indicate the top three 
candidates by formatting their scores in bold face and in larger font size. An 
empty table cell means that the verifier opted-out from the respective main 
category (perhaps participating in subcategories only, restricting the evaluation 
to a specific topic). More information (including interactive tables, quantile plots 
for every category, and also the raw data in XML format) is available on the 
competition web site! and in the results artifact (see Table 4). 


Table 8 reports the top three verifiers for each category. The run time (column 
‘CPU Time’) and energy (column ‘CPU Energy’) refer to successfully solved 
verification tasks (column ‘Solved Tasks’). We also report the number of tasks for 
which no witness validator was able to confirm the result (column ‘Unconf. Tasks’). 
The columns ‘False Alarms’ and ‘Wrong Proofs’ report the number of verification 
tasks for which the verifier reported wrong results, i.e., reporting a counterexample 
when the property holds (incorrect FALSE) and claiming that the program fulfills 
the property although it actually contains a bug (incorrect TRUE), respectively. 


Score-Based Quantile Functions for Quality Assessment. We use score- 
based quantile functions [9,23] because these visualizations make it easier to 
understand the results of the comparative evaluation. The web site !® and the 
results archive (see Table 4) include such a plot for each category. As an example, 
we show the plot for category C-Overall (all verification tasks) in Fig. 5. A total 
of 11 verifiers participated in category C-Overall, for which the quantile plot 
shows the overall performance over all categories (scores for meta categories 
are normalized [9]). A more detailed discussion of score-based quantile plots, 
including examples of what insights one can obtain from the plots, is provided 
in previous competition reports [9,12]. 


16 https:/ /sv-comp.sosy-lab.org/2020/results 
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Table 7: Quantitative overview over all results; empty cells represent opt-outs 


A ME 
ES >> g z B HH S n z= 
2o oA EREMO DEM . BR 
Ag SESE Se bee goa bay Sou a ok See 
4 as 223 s aS Pos ‘gas Fas Gos e" OES 
Participant $52 Fan EJ Sar 599 ELZ 242 995 Pao 
zs ccbEENME O-BEEDO a SS CO [Roo 
2Ls 2491 298 0 340 1264 13 914 4924 
BRICK 
CBMC 2864 -162 554 268 499 30 1256 3365 
CPA-BAM-BnB 602 
CPA-SEQ 4396 355 996 483 1720 746 27729219 
CPA-LOCKATOR -387 
DARTAGNAN 173 
DIVINE -76 71 550 0 0 -12 585 1151 
EsBMC 3481 334 325 264 777 500 1639 5567 
GACAL 
Lazy-CSEQ 1279 
MAP2CHECK -68 -89 
PESCo 4376 1590 8023 
PINAKA 2585 243 590 
PREDATORHP 6 11 
SYMBIOTIC 2753 516 0 294 1022 954 1828 5985 
UAUTOMIZER 26906 354 296 466 2042 591 893 8178 
UKoJAK 1702 231 0 387 0 501 1148 3710 
U'TAIPAN 2185 316 289 461 0 482 805 5057 
VERIABS 5543 0 0 0 0 244 273 2656 
VERIFUZZ 1206 146 
YoGAR-CBMC 1275 
COASTAL 472 
JAvA-RANGER 549 
JAYHORN 278 
JBMc 527 
JDART 524 


SPF 410 
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Table 8: Overview of the top-three verifiers for each category (measurement values for 
CPU time and energy rounded to two significant digits) 


Rank Verifier Score CPU CPU Solved Unconf. False Wrong 
Time Energy Tasks Tasks Alarms Proofs 
(inh) (in kWh) 

ReachSafety 

1 VERIABS 5543 150 1.6 3412 171 

2 CPA-S&EQ 4396 72 Ne 2700 54 8 

3 PESCo 4376 39 38 2518 36 4 

MemSafety 

1 PREDATORHP 611 .78 .010 392 15 

2 SYMBIOTIC 516 .51 .010 358 6 

3 CPA-SEQ 355 .76 .010 264 1 

ConcurrencySafety 

1 LAzv-CSEQ 1279 6.7 .090 1023 44 

2 YoGAR-CBMC 1275 .39 .000 1024 33 

3 CPA-SEQ 996 12 11 830 102 

NoOverfiows 

1 CPA-SEQ 483 93 .010 321 8 

2 UAUTOMIZER 466 1.4 .010 326 0 

3 UTAIPAN 461 1.5 .010 323 0 

Termination 

1 UAUTOMIZER 2942 15 .16 1606 T 

2 CPA-SEQ 1720 16 T 1247 7 

3 2LS 1264 3.2 .030 955 361 3 

SoftwareSystems 

1 SYMBIOTIC 954 25 .000 676 36 3 1 

2 CPA-S&EQ 746 21 .24 1381 363 1 

3 CPA-BAM-BNB 602 8.0 .070 1411 582 3 4 

FalsificationOverall 

1 CPA-SEQ 2772 45 .45 2240 139 9 

2 SYMBIOTIC 1828 27 35 1461 10 3 

3 EsBMC 1639 14 .18 1819 385 16 

Overall 

1 CPA-SEq 9219 120 1.3 6 743 535 9 

2 UAUTOMIZER 8178 83 .84 5523 693 71 2 

3 PESCo 8023 120 1.2 6 402 242 32 

JavaOverall 

1 JAVA-RANGER 549 1.3 .010 376 

2 JBmc 527 18 .000 376 

3 JDarr 524 .26 .000 374 
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Fig. 5: Quantile functions for category C-Overall. Each quantile function illustrates 
the quantile (x-coordinate) of the scores obtained by correct verification runs 
below a certain run time (y-coordinate). More details were given previously [9]. 
A logarithmic scale is used for the time range from 1s to 1000s, and a linear 
scale is used for the time range between 0s and 1s. 


Alternative Rankings. The community suggested to report a couple of al- 
ternative rankings that honor different aspects of the verification process as 
complement to the official SV-COMP ranking. Table 9 is similar to Table 8, but 
contains the alternative ranking categories Correct and Green Verifiers. Column 
‘Quality’ gives the score in score points, column ‘CPU Time’ the CPU usage of 
successful runs in hours, column ‘CPU Energy’ the CPU usage of successful runs 
in kWh, column ‘Solved Tasks’ the number of correct results, column ‘Wrong 
Results' the sum of false alarms and wrong proofs in number of errors, and 
column ‘Rank Measure’ gives the measure to determine the alternative rank. 


Correct Verifiers — Low Failure Rate. The right-most columns of Table 8 report 
that the verifiers achieve a high degree of correctness (all top three verifiers in the C 
track have less than 2 % wrong results). The winners of category Java- Overall pro- 
duced not a single wrong answer. The first category in Table 9 uses a failure rate as 
rank measure; number of incorrect results the number of errors per score point (E/sp). 
We use E as unit for number of incorrect results and sp as unit for total score. It 
is remarkable to see that the worst result was 0.38 E/sp in SV-COMP 2019 and 


is now improved to 0.032 E/sp, with is an order of magnitude better. 


Green Verifiers — Low Energy Consumption. Since a large part of the cost 
of verification is given by the energy consumption, it might be important to 
also consider the energy efficiency. The second category in Table 9 uses the 
energy consumption per score point as rank measure: toral CPU Snerey with the 


unit J/sp. It is interesting to see that the worst result from SV-COMP 2019 
was 4200 J/sp, and now it is improved to 2200 J/sp. 
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Table 9: Alternative rankings; quality is given in score points (sp), CPU time in 
hours (h), energy in kilojoule (kJ), wrong results in errors (E), rank measures in 
errors per score point (E/sp), joule per score point (J/sp), and score points (sp) 


Rank Verifier Quality CPU CPU Solved Wrong Rank 

Time Energy Tasks Results Measure 
(sp) (h) (kWh) (Œ) 

Correct Verifiers (E/sp) 

1 CPA-SEQ 9219 120 1.3 6743 9 .0010 

2 UKoyak 3710 48 0.49 2405 4 .0011 

3 2us 4924 27 0.24 3044 8 .0016 

worst .032 

Green Verifiers (J/sp) 

1 CBMC 3365 15 0.16 3217 67 170 

2 2Ls 4924 27 0.24 3044 8 180 

3 EsBMC 5567 35 0.41 5520 51 270 

worst 2 200 


Table 10: Confirmation rate of verification witnesses in SV-COMP 2020 


Result TRUE FALSE 


Total Confirmed Unconf. Total Confirmed Unconf. 


2Ls 2060 2049 99% 11 449 995 69% 454 
CBMC 1949 1821 93% 128 2095 1396 67% 699 
CPA-SEQ 4347 3958 91% 389 2931 2785 95% 146 
DIVINE 811 793 98% 18 099 672 61% 427 
ESBMC 3779 3701 98% 78 2 204 1819 83% 385 
PESCo 3777 3704 98% 73 2 867 2698 94% 169 
SYMBIOTIC 2196 2146 98% 50 996 1879 94% 117 
UAUTOMIZER 4135 4029 97% 106 2081 1494 72% 587 
UKoJAK 1811 1801 9996 10 606 604 100% 2 
U'TAIPAN 2496 2438 9896 58 308 730 56% 578 
VERIABS 3908 3387 87% 521 536 1332 87% 204 


Verifiable Witnesses. All SV-COMP verifiers are required to justify the result 
(TRUE or FALSE) by producing a verification witness (except for those categories 
for which no witness validator is available). We used six independently developed 
witness-based result validators [19, 20, 21, 25, 66]. 

The majority of witnesses that the verifiers produced can be confirmed 
by the results-validation process. Interestingly, the confirmation rate for the 
TRUE results is significantly higher than for the FALSE results. Table 10 shows 
the confirmed versus unconfirmed results: the first column lists the verifiers 
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Fig.6: Number of participating teams for each year 


of category C-Overall, the three columns for result T'RUE reports the total, 
confirmed, and unconfirmed number of verification tasks for which the verifier 
answered with T'RUE, respectively, and the three columns for result FALSE 
reports the total, confirmed, and unconfirmed number of verification tasks for 
which the verifier answered with FALSE, respectively. More information (for all 
verifiers) is given in the detailed tables on the competition web site !6 and in 
the results artifact; all verification witnesses are also contained in the witnesses 
artifact (see Table 4). Result validation is an important topic also in other 
competitions (e.g., in the SAT competition [5, 69]). 


5 Conclusion 


SV-COMP 2020, the 9*^ edition of the Competition on Software Verification, 
attracted 28 participating teams from 11 countries (see Fig. 6 for the participation 
numbers). SV-COMP continues to offer a broad overview of the state of the art 
in automatic software verification. The competition does not only execute the 
verifiers and collect results, but also validates the verification results, using six 
independently developed results validators. The number of verification tasks was 
increased to 11052 in C and to 416 in Java. As before, the large jury and the 
organizer made sure that the competition follows the high quality standards of 
the TACAS conference, in particular with respect to the important principles 
of fairness, community support, and transparency. 


Data Availability Statement. The verification tasks and results of the com- 
petition are published at Zenodo, as described in Table 4. All components 
and data that are necessary for reproducing the competition are available in 
public version repositories, as specified in Fig. 4 and Table 3. Furthermore, 
the results are presented online on the competition web site for easy access: 
https:/ /sv-comp.sosy-lab.org/2020/results/. 
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Abstract 2LS is a framework for analysis of sequential C programs based on the 
CPROVER infrastructure and template-based synthesis techniques for checking 
both safety and termination. The paper presents the main improvements done in 
2LS since 2018, which concern mainly the way 2LS handles dynamically alloc- 
ated objects and structures as well as combinations of abstract domains. 


1 Overview 


2LS is a static analysis and verification tool for sequential C programs. At its core, it 
uses the KIKI algorithm (k-invariants and k-induction) [1], which integrates bounded 
model checking, k-induction, and abstract interpretation into a single, scalable frame- 
work. KIKI relies on incremental SAT solving in order to find proofs and refutations of 
assertions, as well as to perform termination analysis [2]. 

The 2019 and 2020 competition versions of 2LS feature product and power abstract 
domain combinations supporting invariant inference for programs manipulating shape 
and content of dynamic data structures [4]. Moreover, the 2020 version came with fur- 
ther enhancements for handling advanced features of memory allocation and made a 
step towards a support of generic abstract domain combinations. 


Architecture. The architecture of 2LS has been described in previous competition 
contributions [7,5]. In brief, 2LS is built upon the CPROVER infrastructure [3] and thus 
uses GOTO programs as the internal program representation. The analysed program is 
translated into an acyclic, over-approximate single static assignment (SSA) form, in 
which loops are cut at the edges returning to the loop head. Subsequently, 2LS refines 
this over-approximation by computing inductive invariants in various abstract domains 
represented by parametrised logical formulae, so-called templates [1]. The competition 
version uses the zones domain for numerical variables combined with our shape domain 
for pointer-typed variables. The SSA form is bit-blasted into a propositional formula 
and given to a SAT solver. The KIKI algorithm then incrementally amends the formula to 
perform loop unwindings and invariant inference based on template-based synthesis [1]. 
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2 New Features 


The major improvements of 2LS since 2018 are mostly related to analysis of heap- 
manipulating programs. We build on the shape domain presented in 2018 [5] and in- 
troduce abstract domain combinations that allow us to analyse both shape and content 
of dynamic data structures. Furthermore, we introduce a special handling for the case 
when an address of a freed heap object is re-used for the next allocation. 

Apart from an improved verification of heap-manipulating programs, we also intro- 
duce a generic skeleton of an abstract domain join algorithm, which is a step towards 
a support of generic abstract domain combinations. 


2. Combinations of Abstract Domains 


The capability of 2LS to jointly analyse shape and content of dynamic data structures 
takes advantage of the template-based synthesis engine of 2LS. Invariants are computed 
in various abstract domains where each domain has the form of a template while relying 
on the analysis engine to handle the domain combinators. 


Memory model In our memory model, we represent dynamically allocated objects by 
so-called abstract dynamic objects. Each such object is an abstraction of a number of 
concrete dynamic objects allocated by the same malloc call [4]. 


Shape Domain For analysing the shape of the heap, we use an improved version of 
the shape domain that we introduced in 2018 [5]. The domain over-approximates the 
points-to relation between pointers and symbolic addresses of memory objects in the 
analysed program: for each pointer-typed variable and each pointer-typed field of an 
abstract dynamic object p, we compute the set of all addresses that p may point to [4]. 


Template Polyhedra Domain For analysing numerical values, we use the template 
polyhedra abstract domains, particularly the interval and the zones domains [1]. 


Shape and Polyhedra Domain Combination Since both domains have the form of 
a template formula, we simply use them side-by-side in a product domain combination— 
the resulting formula is a conjunction of the two template formulae [4]. 

This combination allows 2LS to infer, 
e.g., invariants describing an unbounded 
singly-linked list whose nodes contain Beets se teat el pees ean 
values between 1 and 10. We show an ex- val- 3 val= 10 
ample of such a list in Figure 1. Here, all 
list nodes are abstracted by a single ab- Figure 1. Unbounded singly-linked list abstrac- 


stract dynamic object ao, (i.e. we assume ted by an abstract dynamic object a0;. 
that they are all allocated at the same pro- 


gram location). The invariant inferred by 2LS for such a list might look as follows: 


ao, 


(ao,.next = &ao, V ao,.next = NULL) ^ ao,.val € [1, 10]. 
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The first disjunction describes the shape of the list—the next field of each node points 
to some node of the list or to NULL!. The second part of the conjunct is then an invariant 
in the interval domain over all values stored in the list—it expresses the fact that the 
value of each node lies in the interval between 1 and 10. 


2.2 Symbolic Paths 


To improve precision of the analysis, we let 2LS compute different invariants for dif- 
ferent symbolic paths taken by the analysed program. We require a symbolic path to 
express which loops were executed at least once. This allows us to distinguish situ- 
ations when an abstract dynamic object does not represent any really allocated object 
and hence the invariant for such abstract dynamic object is not valid [4]. 

The symbolic path domain allows us to iteratively compute a set of symbolic paths 
P1,- - - , Pn (represented by guard variables in the SSA) with associated shape and data 
invariants J,,..., In. The aggregated invariant is then py > I4 A---A p, => In, which 
corresponds to a power domain combination. 


2.3 Re-using Freed Memory Object for Next Allocations 


In C, it is possible that, after a free is called, the freed memory is subsequently re-used 
when a malloc is called afterwards. Due to this, 

it may happen that the error state in the program int: Ka = matloe(sizeof(int)); 

in Figure 2 is reachable. This situation is, how- Fia dd . : 
ever, difficult to handle for 2LS as its memory Fi in SUPERNE UM * 
model creates a unique abstract dynamic object TL error state 

for each malloc call. To overcome this limitation, 
we have introduced a special variable fr that is 
non-deterministically set to the value of the freed pointer at each free call. If two point- 
ers x, y are compared in the analysed program using a relational operator op, we trans- 
form the comparison x op y into 


Figure 2. Re-using a freed object 


(x op y) © ((x £ fr V nondet;) ^ (y 4 fr V nondet,)). (1) 


Here, nondet, and nondet, are unconstrained boolean variables modelling a non- 
deterministic choice. If neither x nor y has been freed, then the result of Eq. (1) is 
equal to x op y, but if either of the pointers might have been freed, then the result of 
Eq. (1) is non-deterministic, which makes our analysis sound for the described case. 


2.4 Generic Abstract Domain Templates 


As is mentioned in Section 1, abstract domains are represented in 2LS by so-called tem- 
plates. The main reason of templates is that they reduce the second-order problem of 
finding an inductive invariant to a first-order problem of finding values of template para- 
meters. Apart from defining the form of the template (a parametrised logical formula), 
each abstract domain also needs to specify an algorithm to perform join of the current 


! Here, ao,.f is an abstraction of the f fields of all concrete objects represented by ao,. Ana- 
logously, &ao, is an abstraction of symbolic addresses of all represented objects. 
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values of template parameters with a model of satisfiability returned by an SMT solver. 
However, most of the domains use a similar approach to this algorithm, and therefore 
adding a new abstract domain to 2LS requires one to write an algorithm whose skeleton 
has already been written in existing domains. 

To overcome this problem, we proposed a generic algorithm suitable for all existing 
abstract domains (see [6] for details). The main idea is based on the fact that most of the 
templates are conjunctions of multiple formulae, where each has its own parameter and 
describes a part of the analysed program, e.g., properties of a single program variable. 

While this extension did not bring any additional functionality that would increase 
the score of 2LS in this year’s edition of SV-COMP, it opened up possibilities for future 
enhancements, in particular (1) it simplifies adding of new abstract domains capable 
of analysing program properties that 2LS is currently not able to handle and (2) it is 
a significant step towards a support of generic abstract domain combinations that would 
allow 2LS to arbitrarily combine abstract domains and therefore analyse complex prop- 
erties of programs requiring simultaneous reasoning in multiple domains. 


3 Strengths and Weaknesses 


One of the main strengths of 2LS is verification of programs requiring joint reason- 
ing about shape and content of dynamic data structures. In 2019, we contributed 10 
benchmarks into the ReachSafety category requiring such reasoning. The domain com- 
bination described in Section 2.1 allows 2LS to successfully verify 9 out of 10 of these 
benchmarks (the last one has timed out), making it the only tool capable of this apart 
from the category winner. Also, 2LS is notably strong in analysing termination, which 
is supported by the third place in the Termination category. 

Still, there remain a lot of challenges and limitations. The main problem is that 2LS 
still lacks reasoning about array contents, and that it does not yet support recursion. 


4 Tool Setup 


The competition submission is based on 2LS version 0.8.” The archive contains the bin- 
aries needed to run 2LS (2ls-binary, goto-cc), and so no further installation is needed. 
There is also a wrapper script 2ls which is used by Benchexec to run the tools over 
the verification benchmarks. See the wrapper script also for the relevant command line 
options given to 2LS. The further information about the contents of the archive could 
be find in the README file. The tool info module for 2LS is called two_Is.py and the 
benchmark definition file 2ls.xml. As a back end, the competition submission of 2LS 
uses Glucose 4.0. 2LS competes in all categories except Concurrency and Java. 


5 Software Project 


2LS is implemented in C++ and it is maintained by Peter Schrammel with contributions 
by the community.? It is publicly available at http://www.github.com/diffblue/2ls un- 
der a BSD-style license. 


? Executable available at https://doi.org/10.5281/zenodo.3678347. 
> https://github.com/diffblue/2Is/graphs/contributors 
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Abstract. COASTAL is a program analysis tool for Java programs. It 
combines concolic execution and fuzz testing in a framework with built-in 
concurrency, allowing the two approaches to cooperate naturally. 


1 Verification Approach and Software Architecture 


COASTAL analyses Java bytecode with an approach that combines concolic 
execution and fuzz testing in a unified framework. It uses the ASM bytecode 
manipulation library [2] to add code to compiled class files to monitor and inter- 
act with the system under test (SUT). The concurrent COASTAL components 
that carry out the analysis are shown in Figure 1: 


— Multiple divers (for concolic analysis) execute the SUT with different con- 
crete input values. A diver run is triggered when a vector of concrete input 
values is added to the diver input queue d_in. As a diver executes, the in- 
strumented code mirrors the state of the program with symbolic values. At 
the end of the run, the symbolic path condition that corresponds to the 
execution is enqueued in the diver output queue d. out. 

— Multiple surfers (for fuzzing analysis) also execute the SUT with concrete 
input values. A surfer run is triggered when a vector of concrete input values 
is added to the surfer input queue s.n. As a surfer executes, lighter instru- 
mentation records the "shape" of the execution path, and at the end of the 
run, this information is enqueued in the surfer output queue s. out. 

— One or more strategies remove and process the information that appears on 
the diver and surfer output queues. For example, a strategy may remove a 
path condition, negate one or more constraints, invoke an SMT solver to find 
input values that will explore the modified path, and enqueue them on d in 
or s_in or both. Instrumentation injects the input values into the SUT. 

— To share information between components, discovered execution paths are 
stored in a shared execution tree known as the pathtree. The pathtree keeps 
track of which sub-trees have been fully explored. The pathtree data struc- 
ture allows for efficient concurrent updates. 
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Fig. 1. COASTAL architecture 


— Divers, surfers, strategies, and the pathtree signal their actions via a publish- 
subscribe system. When events are published to the message broker, one or 
more observers are notified. The observers may, in turn, emit messages that 
direct the operation of COASTAL. 


1.1 Strategies 


As an example, a depth-first strategy is a simple configuration of COASTAL 
where the strategy employs only a single diver. The diver produces one path 
condition that is processed by the strategy by negating the last (deepest) con- 
straint, and sending it to an SMT solver, which produces new input values (if 
any) that will explore the modified path. If a modified path condition is unsatisfi- 
able, the last constraint is discarded and the process repeats. All path conditions 
are added to the pathtree as they are discovered. At the end of the analysis, the 
pathtree contains a summary of the execution tree of the SUT. 

Other strategies include breadth-first and random exploration. Like depth- 
first exploration, these strategies use only one diver and explore one path con- 
dition at a time. On the other hand, a generational strategy negates all the 
constraints of a path condition, one by one, and produces many potential input 
values. In this case, multiple divers can be used concurrently. Users can also 
deploy multiple strategies at the same time. 


Fuzzing strategies. The user can employ surfers to perform straightforward 
fuzz testing (in the style of AFL [1,5,6]). Surfers use very little instrumentation. 
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Unlike the divers — that instrument every bytecode instruction — only the 
outcomes of branching points are recorded. The “path condition" produced by 
a surfer is therefore a series of (mostly binary) choices that can be added to 
the pathtree; it lacks any details about the reason for the choice (for example, 
instead of “a > 5" it may simply record “false”), but the shape of the path is 
preserved. Multiple divers and multiple surfers are deployed concurrently and 
operate interactively. 


Hybrid strategies. More advanced strategies can combine concolic and fuzzing 
analysis to exploit the strengths of both approaches: surfers (fuzzing) can rapidly 
explore new territory of the execution space, while divers (concolic) can investi- 
gate hard-to-reach corners. Such hybrid strategies enqueue (semi-)random inputs 
on s_in and the results contribute to a “skeletal” pathtree. Since surfers produce 
results at a high rate, the easy-to-explore parts of the execution space are more 
quickly saturated. Unexplored regions of the pathtree are passed to the divers, 
and their results, in turn, open up new regions for the surfers to explore. 


1.2 Observers and Models 


COASTAL was designed with extensibility in mind. One example is the use 
of observers. Any component is allowed to subscribe to the various message 
streams, and can interact with the system by publishing messages of their own, 
or by making direct calls to the public COASTAL API. Examples of observer 
tasks include: 


— monitor assertions and halt COASTAL when they are violated, 
record instruction, line, and condition coverage, 

enforce assumptions and prune undesired execution paths, 
gather information and display progress in a GUI. 


In theory, strategies themselves could be implemented as observers. But since 
they are central to the operation of COASTAL, they are given special treatment. 

Users can replace system- or user-level libraries by more appropriate mod- 
els, either as a whole or on a method-by-method basis. For example, a complex 
library implementation of String.substring() can be replaced with a sim- 
pler, more efficient model that produces the same result and the same symbolic 
constraints. 


2 Strengths and weaknesses 


The tool’s strength lies in the combination of concolic and fuzzing analysis, but 
COASTAL is still under development and a *deep" bug (now fixed) prevented 
the use of fuzzing. Participation in SV-COMP [3] was invaluable in this regard: 
Several bugs and missing functionality were revealed and corrected. 


Results. COASTAL does not output any incorrect answers, but produces an 
unknown result in 19% of cases. This is shown in column “Count” below. 
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Answer Count Immediate 
true 135 32.45% 121 89.63% 
false 202 48.57% 134 66.34% 
unknown 79 18.99% 27 34.18% 
All 416 100% 282 67.79% 


For many cases, the answer is produced instantaneously (column “Immediate” ). 
In the case of unknown answers, this indicates that COASTAL aborted its analy- 
sis because of an as-yet unsupported feature such as symbolic array sizes. For the 
79 — 27 = 52 non-immediate unknown answers, COASTAL timed out because 
of large search spaces. 

The longest-running true answer required 2 diver runs, each taking 20.48sec 
(printtokens eqchk.yml), whereas the longest-running false answer required 
141 diver runs, each taking 0.54sec (spec1-5_producti.ym1). This highlights a 
fundamental weakness of the tool: a long-running SUT takes longer to analyse. 
A generational strategy where multiple divers execute concurrently can amelio- 
rate this problem, but on average does not find errors as quickly as the breadth- 
first strategy. This points to the need to refine the generational strategy to 
prioritize shallow unexplored paths. 


3 Tool setup 


Download. http://doi.org/10.5281/zenodo.3679243 [7] 


Configuration. COASTAL is configured to use a breath-first search strategy 
and a single diver. Z3 [4] is set as the constraint solver. (It is the only external 
tool required to run COASTAL and a Linux executable version is included in the 
download above.) Path conditions are limited to 800 conjuncts, and a time limit 
of 240 second is set. Symbolic strings are limited to 25 characters. Custom models 
are used for some Java classes: Character, String, StringBuilder, Pattern, 
Matcher, Scanner. COASTAL competed in the JavaOverall category. 


Installation. The download above is self-contained. The COASTAL project 
at https://github.com/DeepseaPlatform/coastal/ includes shell scripts to pack- 
age and run COASTAL for SV-COMP in the extra/svcomp subdirectory. The 
scripts needs an external copy of the Z3 solver to be available. 


4 Software Project 


COASTAL is developed by the authors at Stellenbosch University, South Africa. 
It is available at https://github.com/DeepseaPlatform/coastal/ and is distributed 
under the GNU Lesser General Public License version 3. 
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Abstract. DARTAGNAN is a bounded model checker for concurrent pro- 
grams under weak memory models. What makes it different from other 
tools is that the memory model is not hard-coded inside DARTAGNAN 
but taken as part of the input. For SV-COMP’20, we take as input 
sequential consistency (i.e. the standard interleaving memory model) ex- 
tended by support for atomic blocks. Our point is to demonstrate that 
a universal tool can be competitive and perform well in SV-COMP. 
Being a bounded model checker, DARTAGNAN's focus is on disproving 
safety properties by finding counterexample executions. For programs 
with bounded loops, DARTAGNAN performs an iterative unwinding that 
results in a complete analysis. The SV-COMP’20 version of DARTAG- 
NAN works on BOOGIE code. The C programs of the competition are 
translated internally to BOOGIE using SMACK. 


1 Overview and Software Architecture 


DARTAGNAN is à bounded model checker for concurrent programs under weak 
memory models. It expects as input a program P annotated with a reachability 
condition S, a memory model M, and an unrolling bound k. It recursively un- 
winds all loops in P up to the bound k. The unwound program is converted into 
an SMT formula that symbolically represents all candidate executions. The mem- 
ory model will filter out some candidates using a second formula, we explain this 
below. Events of a candidate execution model (instances of) program instruc- 
tions, like memory accesses, local computations, and conditional/unconditional 
jumps. Edges model relations between events, including program order (the or- 
der within a thread), data-dependencies (an assigned variable is used within an 
expression), reads-from (matching each read with the write from which it takes 
its value), and coherence (the order in which writes commit to the memory). 

A memory model can be understood as a predicate over candidate execu- 
tions that declares some of them valid. We describe memory models in the CAT 
language [2]. A memory model is defined as a set of relations (those mentioned 


* Jury member. 


(€ The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 378-382, 2020. 
https://doi.org/10.1007/978-3-030-45237-7. 24 


DARTAGNAN: Bounded Model Checking 379 


SC-+ atomicity 


com = co UfrU rf come = com N ext 
acyclic po U com empty rmw (1 (come; (po U com)* ; come) 


Fig. 1. CAT model used for SV-COMP 20. 


above and others derived as unions, transitive/reflexive closures, compositions, 
etc.) and constraints over them (emptiness, acyclicity and irreflexivity). Given 
a memory model, we construct a formula that evaluates to true precisely un- 
der the candidate executions that are valid according to the memory model. 
Figure 1 shows the memory model used for SV-COMP’20. To support atomic 
blocks, DARTAGNAN adds a specific edge (rmw) for every pair of events be- 
tween VERIFIER atomic, begin() and its matching VERIFIER, atomic, end() 
or in a VERIFIER, atomic. function. We encode atomicity for sequential consis- 
tency (SC) as the empty intersection of rwm and paths starting and ending with 
an external communication (i.e. between different threads). This means once an 
atomic block starts, external communications with the block are forbidden until 
all events in the block have been executed. 


DARTAGNAN comes with a rich assertion language inspired by HERD [I]. 
Assertions define inequalities over the values of local and global variables. They 
can be used freely throughout the code, rather than being limited to the end 
of the execution. Semantically, our assertions do not stop the execution but 
record the failure and continue. To achieve this, each instructions assert (exp) 
is transformed to a local computation f « exp where the fresh variable f € F 
stores the value of exp at the corresponding point of the execution. We refer to 
the formula V. ecp —f as the reachability condition. 


The formula for candidate executions of the program, the formula for valid- 
ity under the given memory model, and the reachability condition together (in 
conjunction) yield the SMT encoding of the reachability problem at hand. Any 
solution to the conjunction corresponds to an execution that is valid according 
to the memory model and violates at least one assertion. Details on the encoding 
can be found in [8,9]. 


DARTAGNAN implements a may-alias analysis to improve pointer precision 
and a novel relation analysis. The latter technique reduces the SMT encoding to 
those parts of the relations that might affect the consistency with the memory 
model, resulting in a considerably smaller formula. Relation analysis improves 
the performance up to two orders of magnitude [4,5]. We remark that related 
approaches represent each candidate execution explicitly [1,6]. Thanks to the 
symbolic representation of executions and static analysis techniques such as re- 
lation analysis, DARTAGNAN is often more efficient [4,5]. 


Figure 2 shows the overall architecture of DARTAGNAN. It reads programs 
written in the litmus format of HERD [1] or the intermediate verification language 
BOOGIE [7]. For the competition, C programs are compiled to LLVM and then 


380 H. Ponce-de-León et al. 


Fig. 2. DARTAGNAN’s architecture. 


translated internally to BOOGIE using the SMACK tool [11]. The SMT solver 
is Z3 [3]. When a violation is found, DARTAGNAN returns a witness execution. 


2 Strengths and Weaknesses 


The main strength of DARTAGNAN is its fully configurable memory model. Un- 
fortunately, in SV-COMP’20 there is no category for verification tasks under 
weak memory models. On the SV-COMP"20 benchmarks, DARTAGNAN reports 
only one incorrect result, being beaten in that aspect only by CPACHECKER, 
DIVINE, LAzv-CSEQ and YOGAR-CBMC; three of them category winners. 
'The incorrect result is related to the use of pointer arithmetic which is currently 
not supported by our alias analysis. 

Its main strength is also its main weakness: DARTAGNAN's performance can- 
not quite match that of other verifiers that were developed specifically for se- 
quential consistency. DARTAGNAN performs particularly poor on benchmarks 
with big atomics blocks. This is the case for most of the verification tasks in 
the pthread-wmm group which represent 83% of the ConcurrencySafety cate- 
gory. The problem is that DARTAGNAN adds rmw edges for all pairs in an atomic 
block. This results in a large encoding (even using relation analysis) and highly 
impacts its performance. 


3 Tool Setup and Configuration 


Besides the program to be verified, DARTAGNAN expects a CAT file containing 

the memory model of interest. For SV-COMP'"20, this is the extension of se- 

quential consistency given in Figure 1. The tool is run by executing the following 

command: 

$ java -jar dartagnan/target/dartagnan-V-jar-with-dependencies. jar 
-cat «CAT file» -i «program file» [options] 
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Placeholder V is the tool version (currently 2.0.5) and options is used to config- 
ure the unrolling bound, the alias analysis, and the fixpoint encoding. The full 
list of options can be found on the project website (see Section 4). 

To make sure not to miss a violation, the competition version of DARTAG- 
NAN implements an iterative approach. Initially, the bounded model checking 
algorithm is called with an unrolling bound of one. If it finds a violation or can 
prove that all loops have been unrolled completely (this is done using unwinding 
assertions), the verification process terminates with a conclusive answer. If not, 
DARTAGNAN increases the bound by one and repeats the process. For program 
with an infinite state space, our tool does not terminate. 

DARTAGNAN participates in the ConcurrencySafety category. No specification 
file is required. The artifact is available on [10]. To reproduce the results of the 
competition, the tool can be executed with the following wrapper script: 


$ Dartagnan-SVCOMP.sh <program file> 


4 Software Project and Contributors 


The project home page is https: //github.com/hernanponcedeleon/Dat3M. DARTAG- 
NAN is open source software distributed under the MIT license. 


Acknowledgement: We thank Dirk Beyer and Philipp Wendler for their help 
during the process of integrating DARTAGNAN into the competition framework. 
We also thank Natalia Gavrilenko for her contributions to the development of 
the bounded model checking engine of the tool [4,5]. 
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Abstract. VeriAbs is a strategy selection based reachability verifier for C code. It ana- 
lyzes the structure of loops, and intervals of inputs to choose one of the four verification 
strategies implemented in VeriAbs. In this paper, we present VeriAbs version 1.4 with 
updates in three strategies. We add an array verification technique called full-program 
induction, and enhance the existing techniques of loop pruning, k-path interval analysis, 
and disjunctive loop summarization. These changes have improved the verification of 
programs with arrays, and unstructured loops and unstructured control flows. 


1 Verification Approach 


VeriAbs is a reachability checker for C code that employs a portfolio of techniques and works 
by smartly selecting a sequence of techniques for each problem instance. Specifically, it 
performs structural and interval analysis of the input code to determine a sequence of suitable 
verification techniques, or a strategy [2]. An earlier version of the tool appeared in [9]. Figure 1 
shows the architecture with this year’s enhancements in dashed lines. When the input program 
contains unstructured loops, VeriAbs performs fuzz testing in parallel with k-induction. If the 
program does not contain unstructured loops but loops manipulating arrays, VeriAbs applies 
array abstraction techniques like loop shrinking, loop pruning, and full-program induction [7] 
in sequence. If the program contains inputs of very short ranges, VeriAbs applies explicit 
state model checking, and loop invariant generation using program behaviour, syntax and 
counter-examples in parallel [2]. Otherwise VeriAbs applies k-path interval analysis, loop 
abstraction, loop summarization, bounded model checking, and k-induction in the order pre- 
sented in the architecture. If any technique successfully (in)validates the encoded properties, 
the tool reports the result, generates the witness, and exits. We next explain the enhancements 
made to VeriAbs this year. 


11 Tool Enhancements 


Full-Program Induction. VeriAbs applies full-program induction as presented in [7] to pro- 
grams manipulating arrays of a symbolic size N given as a parameter. It takes as input 
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a parameterized program represented by P y, annotated with parameterized pre- and post- 
conditions represented by (NV) and ¢)(V) respectively and checks the validity of the Hoare 
triple {p(N)} Px ((N)) for all values of N (>0). We summarize the technique in [7] here. 


In the base case, it verifies that the given Hoare triple holds for a fixed number of values 
of N (say for N = 1). If the check fails, a property violation is reported. It then hypothesizes 
that the Hoare triple {y(NV — 1)} Pw {Y(N — 1)) holds for N > 1, where Pyy_; is the 
program with parameter NV — 1. In the induction step, the technique synthesizes a code 
fragment OP y, called the difference program, such that {p(NV)} Pw {W(V)} is valid iff 
{p(N)}Py—1;0Pw {Y(N)} is valid. The difference program is the computation to be per- 
formed after the program P y—1 has executed to get the same state as P y. It then computes a 
formula Oy(N ), called the difference pre-condition, such that (NV) is implied by the conjunc- 
tion of (.N — 1) and O(N), and that Op(/V) continues to hold after the execution of P y—1. 
1)AOe(N)) AP (U(N)). 
It uses weakest pre-condition computation to infer formulas pre( N) over the variables and 
arrays whose values were computed by P y... and subsequently read in OP y. Base case is 
checked for pre(V) and it is subsequently used to strengthen the pre- and post-conditions 
in the inductive step. The technique, thus, inducts over the entire program via the parameter 
N, in place of inducting over individual loops by using specialized predicates as in [6]. 


The induction step now needs to prove the validity of {4(N — 


Full-program induction does not rely on inductive invariants for each loop in the program. 


k-Path Interval Analysis. VeriAbs implements a k-path 


b-0, d-0, c-30; interval analysis which is an extension of the standard non- 
e xy é B is . . . 
if (s == 10) relational interval domain [2]. It maintains the path-wise data 
c - 30; //Path P1 ranges of variables along a configurable k number of paths at 
oise T a i 2 tn p2 each program point, thus matching the precision of relational 
else if (a » 10) domains. When the number of paths at the join point exceeds k, 
uu b. Z, 2 1 Es Em P3 a subset of paths are merged to maintain k paths at the join point. 
1 c== ams * . . 
d-31; In previous versions, arbitrary subsets of paths were merged. 
if(a »- 10) For SV-COMP 2020, the join operation identifies variables of 
assert(d -- 31); 


interest (VOIs) with respect to the given property to decide 


Fig.2. Example which paths to merge such that VOIs can retain precise values. 


Consider the example shown in Figure 2 with a valid property at line 12 to be analyzed 
with k=2 and the VOI a. It can be seen that three paths — P1, P2 and P3 join at line number 9. 
The enhanced join operation merges paths P1 and P2 so that the resultant paths are as follows: 
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P1+P2: {a=[MIN,10], b=[0,3], c-30, d-0], 

P3: {a =[11,MAX], b-0, c=30, d-31]. 

This information at the join point helps validate the property. Earlier, the join operation could 
merge the path P3 with P1 or P2, leading to an imprecise interval — [0,31] of à at the join 
point, resulting in spurious property violation. Our implementation considers variables used 
in the encoded property as the VOIs. 

Loop Pruning is an array abstraction technique that defines a set of criteria (and a 
resulting set of program transformation rules) which if satisfied by loops processing arrays, it 
is sufficient to analyze the first few elements instead of the entire array [14]. In this version, 
pruning has been extended to programs containing nested loops and multidimensional arrays. 
By structural analysis, we identify if elements of the multidimensional array are processed 
uniformly in loops. If yes, we compute reduced dimensions of the array (for example, 
a[m] [m] may bereducedto a[4] [4]). We have also refined the pruning criteria to improve 
its applicability over multidimensional and dynamically allocated arrays, 56 additional SV- 
COMP'20 ReachSafety benchmarks are solved by the current implementation of array 
pruning as compared to the previous version. 

Disjunctive Loop Summarization. VeriAbs analyses interleavings of unique paths within a 
loop to produce its disjunctive summary to find errors and proofs [2]. In the current version, 
VeriAbs extends this technique in the following situations: (a) while it earlier restricted affine 
transformations to identity matrices, we now allow diagonal matrices with finite monoid [4]; 
(b) we use the approach of generating flattenings as shown in [4] for loops which are flattable; 
(c) we use VeriAbs' general philosophy of deriving over-approximate summaries using the 
techniques in [12], when precise disjunctive summary is not derivable. 


2 Software Architecture 


VeriAbs is primarily developed in Java and Perl. It implements all program analyses (except 
full-program induction) and program transformers in Prism [13], the TCS Research program 
analysis framework. It transforms programs processing multidimensional or dynamically 
allocated arrays in loops to equivalent programs with symbolically sized 1D arrays. This 
transformed program is consumed by VAJRA v1.0 [7], the tool that implements full-program 
induction. VAJRA uses LLVM v6.0.0 [15] compiler infrastructure for program transformations 
and Z3 SMT solver v4.8.7 [10] for checking the validity of Hoare triples and for computing 
weakest pre-conditions. For BMC VeriAbs uses the C Bounded Model Checker (CBMC) 
v5.10 [8] with the Glucose Syrup SAT solver v4.0 [3]. For fuzz testing we enhance American 
Fuzzy Lop [16] to allow test case mutation within valid data ranges generated by k-path 
interval analysis for better path coverage. VeriAbs uses k-induction with continuously refined 
invariants as implemented in CPAchecker v1.8 [5] for an improved precision over our existing 
light weight implementation of k-induction. 

In this version, we additionally derive disjunctive invariants for correctness witnesses 
using abstract acceleration and abstract interpretation, and add them to the control flow 
automaton generated by CPAchecker. If all implemented techniques fail, we use techniques 
implemented in Ultimate Automizer v3204b741 [11] to generate correctness witnesses. 


3 Strengths and Weaknesses 


The main strengths of VeriAbs are (1) strategy selection that correlates strengths of verification 
techniques and input code properties, and (2) a portfolio of sound techniques. Weaknesses: 
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(1) long strategies — the lengths of strategies executed by VeriAbs in the worst case can be ten 
techniques, thus time consuming. Hence, smarter and shorter strategies are needed. (2) Non- 
linear expressions in loops — loop abstractions in VeriAbs assign non-deterministic values to 
variables modified in such expressions. (3) Multidimensional arrays in loops manipulating non- 
contiguous locations — these are limitations of loop shrinking and pruning. These weaknesses 
are not limitations of the state-of-the-art, and appropriate techniques if integrated into VeriAbs 
can be easily invoked by the strategy selector to enable verification of such programs. 


4 Tool Setup and Configuration 


The VeriAbs SV-COMP 2020 executable is available for download at https://gitlab.com/ 
sosy-lab/sv-comp/archives-2019/tree/master/2020/veriabs.zip. To install the tool, download the 
archive, extract its contents, and then follow the installation instructions in VeriAbs/IN- 
STALL.txt. To execute VeriAbs, the user needs to specify the property file of the respective 
verification category using the --property-file option and the -64 option for pro- 
grams with a 64 bit architecture. The witness is generated in the current working directory as 
witness.graphml. A sample command is as follows: 
VeriAbs/scripts/veriabs «-64» --property-file ALL.prp example.c 
VeriAbs participated in the ReachSafety and the SoftwareSystems-ReachSafety categories 
of SV-COMP 2020. The BenchExec wrapper script for the tool is veriabs.py and the 
benchmark description file is veriabs.xml. 


5 Software Project and Contributors 


VeriAbs is maintained by some members of the Foundations of Computing group at TCS Re- 
search [1]. They can be contacted at veriabs.tool@tcs.com. We are thankful to the developers 
of American Fuzzy Lop, CBMC, CPAchecker, Glucose Syrup, LLVM, UAutomizer and Z3 
for allowing us to use the tools within VeriAbs. 
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Abstract. GACAL verifies C programs by searching over the space of 
possible invariants, using traces of the input program to identify poten- 
tial invariants. GACAL uses the ACL2s theorem prover to verify these 
potential invariants, using an interface provided by ACL2s for connecting 
with external tools. GACAL iteratively searches for and proves invariants 
of increasing complexity until the program is verified. 


1 Verification Approach 


GACAL is a tool for verifying reachability queries in C programs by iteratively 
and efficiently performing conjecture generation and conjecture verification. Con- 
jecture generation involves searching through the space of possible conjectures 
using evaluation-based testing to identify likely-to-hold conjectures, and conjec- 
ture verification consists of using software verification technology to verify these 
conjectures. Our initial motivation was to develop a computational agent that 
can automatically complete the Invariant Game [1], in which players suggest in- 
variants that are used by a reasoning engine to verify imperative programs, which 
we did with success- GACAL is a more fully developed form of the underlying 
conjecture generation ideas. This section presents a brief overview of GACAL's 
basic structure and methods for conjecture-based verification, and then discusses 
these, as well as associated challenges, in more depth. Section 2 provides infor- 
mation about the GACAL project, Section 3 provides an evaluation of GACAL, 
and Section 4 concludes this paper and discusses future work. 

In GACAL, conjectures are potential invariants paired with program loca- 
tions. Evaluation-based testing consists of evaluating possible invariants using 
execution-produced program traces. The ACL2s theorem prover [2] verifies con- 
jectures using a graph representation of the input program. To search through 
the space of conjectures, GACAL first constructs a space of terms, which are 
C-expressions composed of the constants, variables, and arithmetic/bitwise oper- 
ators in the program. Terms are combined using relational and logical operators 
to create possible invariants, and possible invariants which hold in all generated 
program traces are promoted to potential invariants and turned into conjectures. 
Discovered potential invariants are then analyzed using ACL2s and, if proven, 
used to verify the program. In the case that the program cannot be verified from 
the currently proven invariants, the above process is repeated: construct new, 
more complex, terms, find potential invariants via testing on program traces, 
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prove potential invariants, attempt program verification, and repeat. At a high- 
level, this loop is the heart of GACAL's conjecture-based verification. 

GACAL's approach to verification presents challenges which can be summa- 
rized into two categories: how to minimize the number of generated conjectures, 
and how to optimize the interactions with ACL2s. The techniques GACAL uses 
to address these challenges, as well as a more in-depth explanation of the previ- 
ously mentioned methods are outlined below. 


Term and Invariant Construction GACAL builds the space of terms by 
iteratively constructing all terms of a fixed size, where the size of a term is the 
number of constants, variables, and operators in that term. GACAL uses a col- 
lection of rewrite rules to filter the newly constructed terms: terms which can be 
rewritten to an equivalent form that has already been constructed are not kept. 
'The size partial order on terms allows GACAL to perform rewriting effectively. 
Furthermore, the term constructor searches for new rewrite rules by evaluating 
and comparing terms under a set of random assignments to find pairs of equiva- 
lent terms. The discovered equivalences are generalized and turned into rewrite 
rules which are added to the collection of rewrite rules. We designed the rewrit- 
ing techniques to have the property that all terms which cannot be rewritten 
are semantically distinct. In general, the term space is at least asymptotically 
exponential in size, and the rewriting techniques above, for the class of problems 
we consider, significantly improve the asymptotics. 

Possible invariants are C-expressions of the form £ == y, rz < y, £ <= y, 
and P || Q, where zx, y are terms and P, Q are possible invariants. We allow 
multiple invariants to be associated with each program location, hence, we do 
not need explicit conjunction. We note that the space of possible invariants is 
closed under logical negation. GACAL filters out possible invariants which can 
be rewritten to an equivalent form that has already been created, reducing the 
size of the invariant space. The order the invariant space is searched over is 
deterministic and independent of the given program, and was chosen because it 
worked well for the benchmark programs. At a high level, GACAL inspects more 
specific invariants before more general invariants (e.g. x == y before z <= y). 


Trace Generation To produce traces through the program GACAL creates 
many initial program states which randomly seed the result of all nondetermin- 
istic behaviors that occur during execution of the program, making them de- 
terministic. For example, a seeded pseudo-random number generator can obtain 
values for 'nondeterministic integer’ expressions. The initial states are propa- 
gated through the program for a bounded number of steps, generating a set 
of states associated with each program location. These initial traces are not 
changed during the course of verification. 

Testing on program traces is essential to GACAL’s conjecture generation, but 
programs may, for example, contain loops with many iterations or not terminate, 
and so obtaining traces which correspond to complete program executions may 
be computationally infeasible or impossible. To address this, GACAL creates 
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additional types of traces which approximate the input program’s behavior. The 
first type of these traces generalizes large constants to small and/or nondetermin- 
istic values, which allows loops with originally many iterations to be completed. 
The second type uses the counter-example generation abilities of ACL2s [3,4,5] 
to generate states at any program location which satisfy all currently proven 
invariants at that location, which are then propagated through the program. 
As GACAL proves more invariants, it recomputes the second type of traces to 
obtain a better approximation of the program. Since invariants tested on these 
traces are later checked for correctness, the fact that the traces may not reflect 
the original program’s behavior does not introduce unsoundness. The states from 
the above two methods are only used to test invariants at a program location 
if there are no states from the original traces produced for that location, and if 
traces cannot be found at all then GACAL assumes all invariants are potential. 


Conjecture Verification To prove conjectures, GACAL uses an algorithm 
which takes previously proven invariants as well as currently unproven potential 
invariants and iteratively removes invariants which cannot be proven until it 
reaches a fixpoint. This process requires a large number of verification queries 
and for the majority of programs checking these queries using ACL2s is where 
the majority of execution time is spent. To improve the ability of ACL2s to 
reason about GACAL queries, we developed an arithmetic library consisting 
of ACL2s theorems about the GACAL-supported C operators. Additionally, 
GACAL caches previous queries and their results, which allows it to answer 
queries that are similar to cached queries, without using the theorem prover. 
Finally, GACAL saves counter-examples that ACL2s provides when it falsifies 
queries and uses them to falsify new queries. 


2 Tool Setup and Software Project and Architecture 


The competition submission’ uses GACAL version 1.0. GACAL requires Python 
3, Java, and Common Lisp, and the competition archive contains all files neces- 
sary to run GACAL without further installation. Other relevant information may 
be found in the README file. GACAL only competes in the C ReachSafety- 
Loops category. GACAL is maintained by Benjamin Quiring and Panagiotis 
Manolios, and is implemented primarily in Common Lisp. The external tools 
used by GACAL are the Eclipse CDT parser and the ACL2 Sedan [2]. GACAL 
is publicly available at https: //gitlab.com/acl2s/conjecture-generation/gacal un- 
der a GNU GPLwvs license. 

GACAL does not handle all C language features. Most importantly, GACAL 
does not handle arrays and types other than 32-bit unsigned and signed integers. 
There is no theoretical reason for this. GACAL does not correctly model C 
semantics for undefined behavior in signed arithmetic. There is a bug in the 
contest submission for translating goto statements into our graph representation 
of programs which affects a small number of benchmarks. 


! Available at https: //gitlab.com/sosy-lab/sv-comp/archives-2020 and Zenodo [6]. 
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3 Evaluation 


GACAL performs best on programs it can execute to completion because this 
allows us to produce high quality traces covering all program locations. When 
this is not the case, GACAL often creates false conjectures which lead to a large 
number of theorem prover queries. Additionally, we note GACAL’s execution 
time depends on the size of the term and invariant spaces, which grow exponen- 
tially based on the number of program variables, constants, and operations. The 
current version of GACAL verifies 66 of the 109 benchmark programs it parses, 
and the top three tools on this distribution verified 102, 70, and 70. There was 
one program which no other tools could verify, though GACAL succeeded. 

The core of GACAL consists of potential invariant generation using program 
traces and the rewriting methods as outlined above. We found that the addition 
of the arithmetic library is essential to our ability to reason about unsigned 
arithmetic and the mod operator, allowing GACAL to verify 10% more total 
programs (which deal primarily with the listed features) and cuts the average 
time to query ACL2s by 33% on the verification queries which were not caught 
by the caching. We found that the additional trace generation methods did not 
significantly increase the number of programs that were verified, though they 
did decrease the average time for verifying a program. The caching of proof 
results and counter-examples is able to eliminate 85% of all verification queries 
from being submitted to ACL2s for checking, which increases the number of 
programs which are verified by over 10% and almost halves the average cost to 
verify a program. The caching methods also amplifies the benefits of the library 
and extra trace generation methods. 


4 Conclusions and Future Work 


There are many ways to improve GACAL, including incorporating classical anal- 
yses such as range analysis, abstract interpretation, symbolic evaluation, etc, 
as well as handling a larger subset of the C language. Another improvement 
to GACAL is to perform the search for disjunctive invariants more efficiently; 
currently GACAL often finds many potential but false disjunctive conjectures, 
which result in a large number of verification queries. One way to improve the 
search may be to analyze the program to find meaningful hypotheses, which 
could considerably lower the number of tested and generated conjectures. 

We believe that GACAL provides evidence that our conjecture-based verifi- 
cation techniques can be used to improve current software verification tools, as 
we were able to verify a competitive number of programs on the distribution we 
parse and we were able to verify a program that all other tools failed to verify, 
despite not using any of the classical analyses identified above. 
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Abstract. Path-merging is a known technique for accelerating symbolic 
execution. One technique, named “veritesting” by Avgerinos et al. uses 
summaries of bounded control-flow regions and has been shown to accel- 
erate symbolic execution of binary code. But, when applied to symbolic 
execution of Java code, veritesting needs to be extended to summarize 
dynamically dispatched methods and exceptional control-flow. Such an 
extension of veritesting has been implemented in Java Ranger by imple- 
menting as an extension of Symbolic PathFinder, a symbolic executor 
for Java bytecode. In this paper, we briefly describe the architecture of 
Java Ranger and describe its setup for SV-COMP 2020. 


1 Approach 


Symbolic execution is a well-known program analysis technique that has been 
applied to many applications such as test generation [3,7], equivalence check- 
ing [6,8], and vulnerability finding [13]. However, when applied to large soft- 
ware, symbolic execution can suffer from scalability challenges caused by path 
explosion. Path-merging techniques such as veritesting [1] and dynamic state 
merging [4] help alleviate these scalability limitations. In particular, veritest- 
ing attempts to construct a static summary of a multi-path region and use it. 
Veritesting has been shown to significantly accelerate symbolic execution of bi- 
nary code. Given that a large amount of software in use today is still written in 
Java, it is desirable to bring the benefits of veritesting to symbolic execution of 
Java as well. However, features such as dynamic dispatch make path-merging for 
Java code challenging [11]. The summary of a multi-path region that contains 
a dynamically-dispatched method call can only be constructed if the method to 
be called can also be summarized. Java Ranger (JR) extends the current state- 
of-the-art path-merging ideas presented by Avgerinos et al. [1] by first building 
static summaries which are later transformed using runtime information such as 
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the dynamic type of an object reference used for accessing a field. Java Ranger 
is built as an extension to Symbolic PathFinder (SPF) [5]. 


2 Architecture 


Java Ranger is implemented as an SPF listener that watches for symbolic branch 
conditions in branching instructions. On encountering a symbolic branch instruc- 
tion, JR attempts to create a summary for the multi-path region that begins at 
that branch instruction and ends at its exit points. A multi-path region is a 
region of code that begins at a branch instruction with a symbolic branch condi- 
tion. An exit point of a multi-path region is either (1) the first program location 
in a control-flow path through the multi-path region which could not be sum- 
marized, or (2) the location of the immediate post-dominator of the multi-path 
region. This mechanism is also explained by Sharma et al. [12] in Figure 4. 


3 Strengths And Weaknesses 


Since JR improves scalability limitations of symbolic execution, its strength can 
only be observed when running it over large software. However, JR falls back 
to vanilla symbolic execution when it finds no opportunity for path-merging. 
SV-COMP 2020 had 416 verification tasks in the Java track. More information 
on SV-COMP 2020 can be found in its competition report [2]. JR instantiated 
at least one static summary on 96 different benchmarks of the 416 benchmarks. 
The summary for a multi-path region can be instantiated more than once on 
each benchmark because it is possible that the symbolic executor will encounter 
the same multi-path region more than once while running the benchmark. In 
total, JR instantiated 356 unique summaries. The total number of instantiated 
summaries used by JR was 20,182. JR also inlined a method summary a total 
of 62,857 times while instantiating these summaries. 

JR also had a “unknown” conclusion on 40 of the 416 SV-COMP 2020 verifi- 
cation tasks. 22 of the 40 were caused due to our JR configuration which turned 
off support for symbolic strings because we found SPF’s support for solving 
string constraints was not stable. 9 “unknown” conclusions were reached due to 
missing support for symbolic array lengths in multi-dimensional arrays. 8 of the 
40 occurred due to a timeout. The last “unknown” result occurs in the equiva- 
lence check verification task in the ApacheCLI benchmark due to JR’s use of a 
depth limit. 

We made use of two depth limit parameters in SV-COMP 2020. The first 
was a limit on the exploration depth of our baseline symbolic executor, SPF. 
The second was a depth limit on the recursive depth to which our method 
summaries would be inlined. While we wished to avoid the use of any such limit, 
we found similar kinds of limits were used by many participanting tools in SV- 
COMP 2019. It is common to use some kind of limitation when applying symbolic 
execution tools in practice, since they can get bogged down by path explosion or 
related problems, and path-merging helps with but does not eliminate this issue. 
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The Java verification category of SV-COMP 2020 did not score a tool’s answer 
differently if it used a depth limit for producing that answer. Instead, the use of 
depth limit is reflected in each tool’s score only if it caused the tool to produce an 
incorrect answer. We describe these depth limits and JR’s configuration options 
in the following section. 


4 Tool Setup and Configuration 


Java Ranger’s setup is very similar to the setup used by SPF. Since Java Ranger 
is simply an extension of SPF, the Java Ranger directory can be specified as 
a valid jpf-symbc extension of JPF. A JR configuration requires the following 
additions. 
veritestingMode = «1-5» 
veritestingMode specifies the path-merging features to be enabled with each 
higher number adding a new feature to the set of features enabled by the previous 
number. Setting veritestingMode to 1 runs vanilla SPF. Setting it to 2 enables 
path-merging for multi-path regions with no method calls and a single exit point. 
Setting it to 3 adds path-merging for multi-path regions that make method calls 
where the method can be summarized by Java Ranger. Setting it to 4 adds path- 
merging for multi-path regions with more than one exit point caused due to 
exceptional behavior and unsummarized method calls. Setting it to 5 adds path- 
merging for summarizing return instructions in multi-path regions by treating 
them as an additional exit point. 

performanceMode - «true or false» 
Setting performanceMode to true causes Java Ranger to minimize the number 
of solver calls to check the feasibility of the path condition when summarizing a 
multi-path region with multiple exit points. 

TARGET.CLASSPATH WALA-«classpath of target code» 
Java Ranger needs this variable to be set up as environment variable. It is not 
part of the .jpf configuration file. This environment variable tells Java Ranger 
where it should be expecting to find code that needs to be statically summarized. 

jitAnalysis=<true or false» 
When turned on (the default value), this option causes JR to summarize multi- 
path regions when it encounters them. When turned off, JR. attempts to sum- 
marize all multi-path regions reachable in a statically-computed interprocedural 
call graph up to a configurable limit. 

recursiveDepth-«an integer value»? 
'This option forces JR to restrict inlining of method summaries up to the value 
provided for this option. We set this parameter to 12 for SV-COMP 2020. 

The following option is a JPF [14] configuration option which we also used 
for SV-COMP 2020. 

search.depth limit-«an integer value»? 
'This option forces JPF to restrict its exploration to the depth provided as the 
value for this option. JPF constructs a tree of possible choices and explores the 
tree in a heuristic order, depth-first by default. Since JR is built as an extension 
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to SPF, which is in turn built as an extension to JPF, we were able to restrict 
JR’s exploration of choices using this option. We set this parameter to the value 
13 for SV-COMP 2020. 


5 Software Project and Contributors 


Java Ranger is an extension of SPF. It is maintained on GitHub [9]. The version 
of Java Ranger that participated in Sv-COMP 2020 is publicly available [10]. 
For more information, please contact the authors of this paper. 
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Abstract. JDART performs dynamic symbolic execution of JAVA pro- 
grams: it executes programs with concrete inputs while recording sym- 
bolic constraints on executed program paths. A constraint solver is then 
used for generating new concrete values from recorded constraints that 
drive execution along previously unexplored paths. JDART is built on 
top of the Java PathFinder software model checker and uses the JCoN- 
STRAINTS library for the integration of constraint solvers. 


1 Overview 


JDART is a dynamic symbolic execution engine for the JVM build on top of 
Java PathFinder (JPF) [11]. Dynamic symbolic execution [4,6] (sometimes also 
referred to as concolic execution) executes programs with concrete values while 
recording symbolic constraints for execution paths. The approach combines the 
benefits of fast concrete execution with the possibility of generating new concrete 
values, triggered by symbolic constraints, that exercise previously unexplored 
program behaviors. JDART can be used for checking assertions in Java programs: 
Concolic execution will explore new program paths until either (a) an assertion 
violation is discovered, (b) all program paths have been explored, or (c) resource 
limits of the analysis are exhausted. 

The initial driver of the development of JDART was the need for an analysis 
that is robust enough to handle large and complex systems, concretely the AU- 
TORESOLVER software for prediction and resolution of airplane loss of separation 
developed at NASA Ames Research Center [7]. Though JDART provides a robust 
and scalable platform for dynamic symbolic analysis of JAVA programs [7], we 
had to extend its functionality in several ways in order to be able to compete at 
SV-COMP 2020 [1]. We developed: 


1. a new analysis mode in which fresh symbolic variables are introduced during 
analysis (in contrast to a fixed number of manually declared symbolic values), 

2. a number of symbolic models encoding environment behavior (driven by 
SV-COMP 2020 benchmarks), and 

3. a new mode for solving constraints in a sequence of attempts using succes- 
sively weaker bounds on variables (cf. Section 2). 


© The Author(s) 2020 
A. Biere and D. Parker (Eds.): TACAS 2020, LNCS 12079, pp. 398—402, 2020. 
https: //doi.org/10.1007/978-3-030-45237-7_28 


JDART: Dynamic Symbolic Execution for Java Bytecode 399 


V 


Extensions 


Method Summarizer Test Suite Generator EX 


Valuation Y Constraints 


Fig. 1: Architecture of JDART [7]. 


While (1) enabled JDART to enter the competition, (2) accounts for the largest 
part of improvements over our own baseline, and (3) contributes to better per- 
formance on some benchmarks with assertion violations in big state spaces. 


2 Architecture 


JDART combines dynamic execution with recording and analysis of symbolic 
path constraints. It runs as an extension of the JPF software model checker [11]. 
In particular, JDART uses the JAVA virtual machine implemented by JPF and 
its capabilities for annotating values on the stack and the heap with symbolic 
information. The tool itself is written in JAVA and uses JCONSTRAINTS [5] for 
encoding SMT problems. Moreover, JCONSTRAINTS acts as a frontend to an 
SMT solver (e.g., Z3 [3]) used for finding concrete values that drive the analysis. 

Figure 1 illustrates the architecture of JDART: The tool consists of three lay- 
ers: Concrete analysis frontends make up the top layer (e.g., generation of method 
summaries, generation of test suites, assertion checking). The main components 
record and analyze execution paths (Explorer) and perform concolic execution 
(Executor). The Executor uses concolic implementations of bytecode instruc- 
tions. These bytecodes are executed instead of the original JPF bytecodes. A 
concolic bytecode tracks the symbolic representation of a value and annotates 
a concrete value with its symbolic counterpart. Whenever execution takes a 
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branching decision based on a concrete value with a symbolic annotation, the 
symbolic value is added to the constraints tree maintained by the Explorer. A 
constraint solver is used for finding concrete values that drive execution along 
unexplored paths of the tree. 

Leveraging the modular architecture of JDART and JCONSTRAINTS, we im- 
plemented a meta-constraint solver for finding small concrete values for symbolic 
numeric variables. This allows JDART to find assertion violations faster and with 
less resource consumption in cases where a symbolic variable controls the number 
or length of execution paths (e.g., symbolic array size or a symbolic loop bound). 
'The meta-constraint solver performs multiple calls to an SMT solver, adding suc- 
cessively weaker bounds to numeric variables. E.g., for a path constraint p over 
symbolic numeric variable x, the solver adds bounds (—z < x) ^ (a € z) with 
z€(1, 2, 3, 5, 8, 13, 21, ...), i.e., the first numbers in the Fibonacci sequence. 
If the solver finds a model for the constraint, JDART uses this model for driving 
concolic execution. In case no model is found in a fixed number of attempts, 
the SMT solver is called without added bounds. The number of attempts is a 
configuration parameter of JDART and was fixed to 7 for SV-COMP 2020. 

Analysis of JDART can be bounded by termination strategies. When checking 
assertions the termination strategy is stopping on the first occurrence of an as- 
sertion violation. Additional strategies could be bounding depth of the symbolic 
analysis, bounding runtime, or termination on specific errors. We refer the reader 
to [7] for a more detailed and complete discussion of the features of JDART. 


3 Strengths and Weaknesses 


JDART scored 524 points (max. of 602) in the JAVA track and was declared 
third winner for JAvA, behind JBMC (527 points) [2] and JAVA RANGER (549 
points) [9]. All other tools scored considerably fewer points than JDART (next 
best is COASTAL [10] with 472). As JAVA RANGER and JBMC, JDART did 
not report a single incorrect verdict. JDART exhibits the general strengths and 
weaknesses of dynamic and symbolic analysis approaches for JAVA programs: 


Runtime. Driven by concrete execution, the analysis is fairly fast. JDART is 
overall the second fastest tool in cases where it can provide an answer. Not 
using bounds JDART, on the other hand, has a relatively high number of 
timeouts and runs that terminate due to resource limitations — and thus 
only the fourth lowest cumulative runtime. 

Symbolic Strings. Particular to JAVA verification is the challenge of provid- 
ing models for the behavior of classes in the JAVA standard library. In 
SV-COMP 2020 such models are mostly required for analyzing benchmarks 
that extensively incorporate String processing. We made a substantial contri- 
bution to the code base of JDART and implemented models for java.lang. 
String and related classes. As a consequence, JDART can analyze all but 
one corresponding benchmark examples (JDART currently cannot analyze 
regular expressions symbolically). 
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Unbounded Behavior. Based on principles of symbolic execution, JDART 
does not terminate on unbounded loops or in case of unbounded recursion, 
leading to a number of timeouts on the corresponding set of benchmarks. 


4 Tool Setup 


The source code of JDART used for the competition artifact [8] is available 
on GitHub!. JDART is designed as a plug-in to JPF and relies on ant as a 
build system. One of its dependencies is the jpf-core project [11]. The other 
dependency is the JCONSTRAINTS library, which was configured to use Z3 [3] 
with incremental solving as a constraint solver for SV-COMP 2020. 

For the competition, JDART is wrapped by the run-jdart.sh shell script 
which generates .jpf configuration files, specifying which benchmark to analyze 
and the global configuration options to JDART: For SV-COMP 2020 all termi- 
nation criteria except for assertion violations are disabled, executing JDART as 
an almost unbounded assertion checker (the only bound in place is an upper 
bound of 127 on maximal length of String variables). The shell script records 
and interprets the output of JDART and can also report the version of JDART. 


5 Software Project 


The version of JDART that was used in SV-COMP 2020 is maintained by the 
Automated Quality Assurance Group at Technical University of Dortmund (in 
particular by the authors of this paper) and is available under the Apache Li- 
cense, version 2.0, on GitHub?. An initial version of JDART was developed by the 
authors of [7] at NASA Ames Research Center and Carnegie Mellon University. 
The original version of JDART is available on GitHub?. 


Acknowledgments. We are grateful for the work on JDART and JCONSTRAINTS 
by the respective original authors. Our success would not have been possible 
without their contributions. 


References 


1. Beyer, D.: Advances in automatic software verification: SV-COMP 2020. In: 
Proc. TACAS (2). LNCS 12079, Springer (2020), https://www.sosy-lab.org/ 
research/pub/2020-TACAS.Advances_in Automatic Software Verification _ 
SV-COMP 2020.pdf 


! https:/ /github.com/tudo-aqua/jdart, 
Commit c7e30a29b98a69df2c7c96ae39b90ba0f e00e204 
? https:/ /github.com/psycopaths/jdart 


402 


10. 


11. 


M. Mues and F. Howar 


Cordeiro, L., Kroening, D., Schrammel, P.: Jbmc: Bounded model checking for java 
bytecode. In: Beyer, D., Huisman, M., Kordon, F., Steffen, B. (eds.) Tools and 
Algorithms for the Construction and Analysis of Systems. pp. 219—223. Springer 
International Publishing, Cham (2019). https: //doi.org/10.1007/978-3-030-17502- 
3 17 

De Moura, L., Bjgrner, N.: Z3: An efficient smt solver. In: International conference 
on Tools and Algorithms for the Construction and Analysis of Systems. pp. 337- 
340. Springer (2008). https://doi.org/10.1007/978-3-540-78800-3 24 

Godefroid, P., Klarlund, N., Sen, K.: Dart: Directed automated random test- 
ing. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming 
Language Design and Implementation. pp. 213-223. PLDI ’05, ACM (2005). 
https: //doi.org/10.1007/978-3-642-19237-1_ 4 

Howar, F., Jabbour, F., Mues, M.: JConstraints: A library for working with 
logic expressions in Java. In: Models, Mindsets, Meta: The What, the How, and 
the Why Not?, pp. 310-325. Springer (2019). https://doi.org/10.1007/978-3-030- 
22348-9 19 

King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385- 
394 (1976). https://doi.org/10.1145/360248.360252 


. Luckow, K.S., Dimjasevic, M., Giannakopoulou, D., Howar, F., Isberner, M., Kah- 


sai, T., Rakamaric, Z., Raman, V.: JDart: A dynamic symbolic analysis framework. 
In: Proceedings of TACAS 2016. pp. 442-459 (2016). https: //doi.org/10.1007/978- 
3-662-49674-9 26 


. Mues, M., Howar, F.: JDart artifact used in SV-COMP 2020. Zenodo (2020). 


https: //doi.org/10.5281/zenodo.3678593 


. Sharma, V., Hussein, S., Whalen, M., McCamant, S., Visser, W.: Java Ranger 


at SV-COMP 2020 (competition contribution). In: Biere, A., Parker,D. 

(eds.) TACAS 2020. LNCS,vol. 12079, pp. 393-397. Springer, Cham (2020). 

https: //doi.org/10.1007/978-3-030-45237-7 2T 

Visser, W., Geldenhuys, J.: COASTAL: Combining concolic and fuzzing 
for Java (competition contribution). In: Biere, A., Parker, D. (eds.) 
TACAS 2020. LNCS,vol. 12079, pp. 373-377. Springer, Cham (2020). 

https://doi.org/10.1007/978-3-030-45237-7 23 

Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F.: Model check- 
ing programs. Automated Software Engineering 10(2), 203-232 (Apr 2003). 
https: //doi.org/10.1023/A:1022920129859 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http:/ /creativecommons.org/licenses/by /4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 


or format, as long as you give appropriate credit to the original author(s) and the 


source, provide a link to the Creative Commons license and indicate if changes were 


made. 


The images or other third party material in this chapter are included in the chapter’s 


Creative Commons license, unless indicated otherwise in a credit line to the material. If 


material is not included in the chapter's Creative Commons license and your intended 


use is not permitted by statutory regulation or exceeds the permitted use, you will need 


to obtain permission directly from the copyright holder. 


® 


Check for 
updates 


Map2Check: Using Symbolic Execution and Fuzzing 


(Competition Contribution) 


TACAS 


Artifact 
SV-COMP 


Herbert Rocha! *(9, Rafael Menezes*®, 2020 
Lucas C. Cordeiro*®, and Raimundo Barreto? Accepted 


! Department of Computer Science, Federal University of Roraima, Roraima, Brazil 
herbert.rochaQufrr.br 
?Department of Computer Science, University of Manchester, Manchester, United Kingdom 
3Institute of Computing, Federal University of Amazonas, Amazonas, Brazil 


Abstract. Map2Check is a software verification tool that combines fuzzing, sym- 
bolic execution, and inductive invariants. It automatically checks safety proper- 
ties in C programs by adopting source code instrumentation to monitor data (e.g., 
memory pointers) from the program's executions using LLVM compiler infras- 
tructure. For SV-COMP 2020, we extended Map2Check to exploit an iterative 
deepening approach using LibFuzzer and Klee to check for safety properties. We 
also use Crab-LLVM to infer program invariants based on reachability analysis. 
Experimental results show that Map2Check can handle a wide variety of safety 
properties in several intricate verification tasks from SV-COMP 2020. 


1 Overview 


Fuzzing involves providing random data as input to a program and then checks for 
crashes. By contrast, path-based symbolic execution is an entirely static method that 
symbolically explores the program state-space [1]. Due to a focus on single runs, fuzzing 
techniques scale up relatively well. Path-based symbolic execution gives more confi- 
dence in the verification results, but it suffers from the path-explosion problem, thus 
limiting scalability. Here we exploit an iterative approach using fuzzing and symbolic 
execution to implement a tool named Map2Check v7.3.1 . Our main original contribu- 
tions include: (i) use LibFuzzer [7] to provide random data as input to C programs to 
quickly expose "shallow" bugs, i.e., those that do not require complex data input; (ii) 
implement a new runtime library and instrumentation approach to monitor for crashes, 
failing built-in assertions and pointer safety; (iii) adopt Crab-LLVM [11] to infer invari- 
ants; (iv) exploit a sequential approach with LibFuzzer and KLEE [3] to check safety 
properties in a novel way; and (v) adopt MetaSMT as a wrapper around various SMT 
solvers, e.g., Boolector [2] and Yices [4], previously not supported by our tool. The SV- 
COMP'20 results show that Map2Check can be useful in both falsifying and proving 
reachability error and pointer safety-related properties. 
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2 Verification Approach 


Map2Check uses compiler techniques to analyze C programs using LLVM compiler in- 
frastructure, thereby tracking pointer addresses and variable assignments in the LLVM 
bitcode [8]. In order to hold all values used in the analysis, a container API is employed 
in Map2Check. The tool also generates built-in assertions and checks them adopting an 
approach with fuzzing (to falsify properties) and symbolic execution (to prove the cor- 
rectness). Fig. | illustrates the Map2Check flow, which has the following main steps: 
(i) convert the C code into the LLVM IR using Clang [5]; (ii) simplify the code via 
constant propagation and dead code elimination after the code instrumentation; (iii) to 
apply further Clang optimizations (e.g., canonicalize natural loops and promote mem- 
ory to register); (iii) add Map2Check library functions to check the analyzed LLVM 
bitcode; (iv) generate inputs for Map2Check instrumented functions by executing Lib- 
Fuzzer and then KLEE with Crab-LLVM; and (v) generate the witness file by identify- 
ing each basic block executed in the control-flow graph of the LLVM IR. 


Code Optimization — LLVM IR LLVM PASSES 
| ciang jets = call i32 (...) énondet()e-7] ^ © 
call i8* @malloc(i64 4) we 


%6 = 
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3. int b - malloc(4); aa maroc ts ) LibFuzzer|— 
Sef > 12) gil = icmp sgt i32 $10, 12 M 
5. b=a; H ! i o= 

ud br il $11, label %12, label 214 | Verify Result y: 
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7. free(b); dete : 

$16 = call ... (i32 (...)* @free 
Start Verification MK Crab-LLVM »| KLEE [— 


klee_assume (CRAB) 
Inductive Invariants 


Fig. 1. Map2Check Verification Flow. 


In order to explore the program states and to generate inputs for the Map2Check in- 
strumented functions, the LibFuzzer implementation works by creating a custom entry 
point, which contains an array of bytes (of uint8_t). Thus, our implementation con- 
sists of generating concrete values from non-deterministic inputs that are our fuzzy tar- 
gets. Additionally, we run multiple libFuzzer processes in parallel, where N fuzzing jobs 
should run to completion, i.e., until a bug is found or time/iteration limits are reached. 
Our fuzzing is coverage-guided (e.g., clang coverage), which tries to maximize the code 
coverage of a program. In our case, we adopted an inline-8bit-counters option 
from LLVM (SanitizerCoverage) for code coverage instrumentation built-in, where the 
compiler will insert inline counter that should be incremented on every edge. 

The KLEE implementation works by creating a variable for the used data type, 
makes it symbolic, and then returns its value. As a result, KLEE produces concrete in- 
puts for different program executions. We extend our KLEE implementation by adopt- 
ing MetaSMT [6], which is an Embedded Domain Specific Language for SMT solvers. 
The API provided by MetaSMT is translated at compile-time, through template meta- 
programming, into the native APIs provided by the SMT solvers [9]. Therefore, the 
overhead introduced by MetaSMT is small. 
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In order to improve the KLEE core solver execution, the KLEE tool is ran adopt- 
ing: counterexample caching solver, which can be used to avoid calling the underlying 
solver in certain situations; and MetaSMT, which is employed to construct expressions 
that will be cached for each constraint to facilitate expression reuse. Note that symbolic 
execution often requires concrete solutions for satisfiable queries, e.g., before calling an 
external function, all symbolic bytes need to be replaced by concrete values, simplify 
constraints, and reuse query results [9]. Therefore, the KLEE cache solver is an impor- 
tant optimization, mainly of the counterexample cache that is based on the observation 
that many constraint sets are in a subset/superset relation. 

To check the unreachability of an error location, we reduced the number of states 
in the analyzed program to be explored, thereby supplying invariants to the back-end 
solvers. We adopted Crab-LLVM [11] to infer inductive invariants as constraints to the 
error location. Therefore, the invariants are automatically introduced into the program 
as assumptions (before verification), and then KLEE receives the code as input. Crab- 
LLVM is a static analyzer that employs an abstract interpretation engine over LLVM 
bitcode based on the Crab library, which uses abstract domains such as intervals, oc- 
tagon, and polyhedra. Crab is built on the top of IKOS! (Inference Kernel for Open 
Static Analyzers) to support a collection of abstract domains and fixpoint iterators. 


3 Software Architecture 


Map2Check v7.3.1 is implemented as a source-to-source transformation tool in C/C++ 
using LLVM (v6.0). Map2Check uses Clang (v6.0) as a front-end to parse a C program 
and to generate the respective LLVM bitcode to be used in the code transformation 
to track pointers and variable assignments. It uses LibFuzzer [7] (v6.0) and KLEE [3] 
(v2.0, as a symbolic execution) to automatically produce inputs to execute different pro- 
gram paths. MetaSMT (v4.rc2) is the API of reasoning engines. For SV-COMP’ 20, we 
adopt Yices (v2.5.1) that is used by KLEE to check constraints over bit-vectors and ar- 
rays, which substantially improved our results. Crab-LLVM [11] is used on reachability 
mode to infer inductive invariants for LLVM bitcode. 


4 Strengths and Weaknesses of the Approach 


Map2Check analyzed intricate verification tasks. The tool achieved the 2nd place in the 
ReachSafety-Arrays subcategory; in the ReachSafety-BitVectors category, Map2Check 
achieved a score of 46, thereby presenting better results than Pinaka, UKojak, VeriFuzz, 
and DIVINE. In other subcategories, our tool generated correct-unconfirmed and incor- 
rect true results. These results are, in part, explained due to the Map2Check bugs in 
the witness generation and limitation to handle Crab-LLVM invariants from the over- 
approximations. We are investigating how to extend our tool by combining the data 
from fuzzing with KLEE as program assumptions using template invariant. 

In the MemSafety category, Map2Check achieved a score of —68. However, our 
tool achieved essential results in comparison with the state-of-art tools, e.g., in the 
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MemSafety-heap subcategory achieved a score of 174, which outperforms UAutomizer, 
ESBMC, DIVINE, and CBMC. Most incorrect results are, in part, explained due to bugs 
in the pointer tracking from our memory model, which could be improved by a trace 
semantics with program optimizations as relations on sets of the trace. Sadly, in the 
NoOverflows category, the score was —89. The incorrect results are, in part, explained 
due to bugs in the overflow analyzer. One way to improve this result is by combining 
the CPU flag postcondition test (LLVM supports several intrinsic functions, e.g., an add 
operation returns a structure with the result and overflow flag) with Sanitizers checking. 


5 Tool Setup and Configuration 


In order to run our map2check-wrapper . py script [10],” one must set the property file 
(-p) and the verification task; it provides as result: TRUE + Witness, FALSE + Witness, 
or UNKNOWN. For each error-path or correctness witness, a file (called witness. 
graphml) with the witness proof is generated in the Map2Check root-path folder. The 
dependencies, e.g., Clang and Yices tools, are included in the Map2Check distribution. 
The Benchexec tool info module is named map2check.py and Map2Check participates 
in SV-COMP"20 (as in the map2check. xml benchmark definition) in the following cat- 
egories: ReachSafety-Arrays, ReachSafety-BitVectors, ReachSafety-ControlFlow, Reach 
Safety-Heap, ReachSafety-Loops, ReachSafety-Recursive, MemSafety, and NoOver- 
flows. 


6 Software Project 


Map2Check v7.3.1 ? is open source software distributed under the GPL license. We 
provide instructions for building Map2Check from the source in the file README 
(including the description of all dependencies). Map2Check is a joint project with the 
Federal University of Roraima and the Federal University of Amazonas in Brazil. 
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Abstract. This paper concentrates on improvements of the PredatorHP shape an- 
alyzer in the past two years, including, e.g., improved handling of interval-sized 
memory regions or new support of memory reallocation. The paper character- 
izes PredatorHP’s participation in SV-COMP 2020, pointing out its strengths and 
weakness and the way they were influenced by the latest changes in the tool. 


1 Verification Approach and Software Architecture 


We first briefly recap the main ideas behind PredatorHP and then discuss significant 
improvements that have been done in the tool in the past two years. 


1.4 The Predator Shape Analyzer 


Predator is implemented using C++ and the Boost libraries as a GCC plug-in on top of 
the Code Listener framework [2], which we recently upgraded to work with GCC 7.4.0. 
Moreover, as shown below, we extended Code Listener by adding a type analysis phase 
before the compiled code is passed to the shape analysis implemented in Predator. In 
case a memory safety property is to be checked and there are no complex types, such as 
structures, unions, arrays, strings, or pointers in the program under analysis (including 
possibly unreachable code), we directly assume the program to be memory safe. 


source files| compiler config.h analysis 
*.c,*.h options (re-build) options 


front end errors | predator 


analyzers a : 
VarKiller loc info verinen 
CamplextspeChle iterators erne 


Code Listener IR 


errors with location info 


stderr witness.xml 


The main aim of Predator is shape analysis of sequential C programs that use low- 
level C pointer statements to implement various kinds of lists (singly- or doubly-linked, 
possibly circular, nested, and/or shared). Predator looks for various memory-related er- 
rors (invalid pointer dereferences, double free operations, memory leaks, etc.), and it 
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also checks validity of assertions present in the code. Predator uses abstract interpre- 
tation based on the domain of symbolic memory graphs (SMGs) [1]. Predator abstracts 
uninterrupted sequences of singly- or doubly-linked memory regions into appropriate 
kinds of list segments. Further, Predator abstracts numerical values (either values stored 
in memory regions, sizes of the regions, or offsets of pointers) using intervals with con- 
stant bounds. The constants used as the bounds have a pre-defined maximum/minimum 
value defined in the configuration of Predator (+32/-32 for SV-COMP’ 20). If the max- 
imum/minimum value is exceeded, the bound is set to plus or minus infinity. Predator 
uses summaries to speed up analysis of programs structured into functions. Recursive 
programs are, however, analysed up to a given call depth only. 
PredatorHP, l.e., the Preda- source file | propertyfile| 


tor Hunting Party [3,4], whose ee eee 


flow of control 1S shown on 5 scheduler: DFS 3 scheduler: BFS S scheduler: DFS Sscheduler: DFS 
the right, is implemented as |& heap abstraction 3 no VarKiller E depth 900 3 depth 1900 
i J & join » v, sampled intervals | |; sampled intervals 
a Python script, and used to in- |> call cache = A ES 
crease the efficiency and pre- 
[safe witness.xml] [error+witness.xml | 


cision of the analysis. Namely, 
PredatorHP runs the base Predator verifier in parallel with several Predator hunters that 
do not use the list-segment abstraction, do not join semantically different SMGs, nor use 
function summaries with matching of call parameters based on SMG entailment. While 
the Predator verifier can claim a program correct, it cannot report errors to avoid false 
alarms caused by abstraction. Predator hunters are classified as breadth-first (BFS) and 
depth-first (DFS). The DFS hunters have a limit on the search depth defined as a certain 
number of GCC's GIMPLE instructions. The hunters can normally only report errors. 
The only exception is when the verified program has a finite state space that is fully 
explored by the BFS hunter in the given time limit. 

In SV-COMP'20, based on empirical data, the BFS hunter does not use the Preda- 
tor's VarKiller, which removes dead variables from SMGs. This led to a significant 
speedup on 5 verification tasks (and some slowdown on 3 tasks). Further, the most 
shallow DFS 200 hunter, searching up to the depth of 200 instructions and used in 
PredatorHP up to SV-COMP'19, was removed as it was not bringing any advantage 
wrt the DFS 900 hunter, and a DFS 1900 hunter was added to handle more complex 
tasks (in particular, nmemsafety-ext2/split list test05-1,ntdrivers/ 
floppy.i.cil-3). However, note that the DFS 900 hunter remains needed as oth- 
erwise 11 verification tasks would time out. 


1.2 Recent Modifications of PredatorHP 


One of the main improvements of the latest version of Predator is that its SMG-based 
analysis has been extended to support memory reallocation on the heap. If a reallocation 
operation is executed on an SMG, two new SMGs are produced. The first one models 
the case when a new object of the required size is created, data from the old object are 
copied into the new object, and the old object is freed. In the second case, the existing 
object is resized. If the size decreases, Predator checks that no memory leak happens 
due to some pointer field is removed or invalidated (in case it is partially removed). 
Another improvement concerns working with interval-sized memory regions, which 
arise when allocating structures or arrays of parametric size. Despite even older versions 
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of Predator were able to create such regions, the way in which they could have been 
treated in the subsequent analysis of the program was very limited. In particular, it was 
impossible to dereference interval-sized regions, and hence Predator was very weak 
when analysing programs with structures or arrays of an in-advance-not-fixed size. This 
situation was first improved for SV-COMP?’ 19 in the following pragmatic way. 

Namely, whenever Predator hits a conditional statement that would previously yield 
an interval value with fixed bounds (such as the statementif (n>=0 && n«10) for 
so-far unconstrained n), it will split the further analysis into as many branches as the 
number of values in the interval is, each of them evaluating for a concrete value from the 
interval. After the split, no further interval-based allocations and dereferences, which 
the previous version of Predator used to fail on, happen. In order for the splitting not 
to cause a memory explosion, the latest version of Predator contains a parameter that 
controls the maximum size of split intervals, which was set to 300 in SV-COMP’ 20. 

The above modification of Predator concerned dealing with memory regions whose 
size is given by an interval with finite bounds. In case one of the bounds is infinite, 
Predator has been extended to sample the interval and perform the further analysis with 
the sampled values. Currently, the sampling is done simply by taking some number of 
concrete values from the given interval starting/ending with the bound that is fixed (of 
course, for memory regions, unboundedness from above does only make sense). The 
number of considered samples is currently set to 3. Of course, this strategy cannot be 
used to soundly verify correctness of programs, and so it is used for detecting bugs only. 

Despite the above mentioned treatment of intervals was primarily designed for deal- 
ing with interval-sized memory regions, it can help in other cases of dealing with in- 
tegers too. Namely, it can help both when dealing with integer data as well as when 
dealing with interval-based pointer offsets. 

Next, we have implemented checking whether all dynamically allocated memory 
has been deallocated when a function with the noreturn attribute (such as abort or 
exit) is called. The implementation simply searches the SMG representing the mem- 
ory at the moment of a call of a noreturn function and checks that it does not contain 
any valid dynamically allocated object. 

We have also added a support of the clobber instruction of GIMPLE, which termi- 
nates the life time of local variables of code blocks. Upon this instruction, Predator now 
marks the concerned memory region as deallocated, allowing it to detect invalid deref- 
erences of objects local to a block from outside of the block. Further, we have added 
a support of the instructions modulo and bitwise-or and created models of the stan- 
dard library functions for strcmp and realloc. This fixed several problems such as 
reporting false alarms when assigning fully-overlapping structures. 

Finally, we improved the generation of witnesses. Apart from some bug fixes, we 
changed the trace generation for the reachability category. Namely, in this category, 
if some trace ends with an error other than calling _VERIFIER_error, the analysis 
recovers and continues to search for other traces. 


2 Strengths and Weaknesses 


The main strength of PredatorHP is that it treats code with various kinds of unbounded 
lists in a sound and efficient way. Predator hunters then allow it to quickly handle pro- 
grams with a small finite state space (e.g., benchmarks from 1ist—simple) and avoid 
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many false alarms that could otherwise happen. Interestingly, among the 328 correct 
tasks in ReachSafety-Heap, MemSafety-Heap, and MemSafety-LinkedLists, only 98 use 
unbounded data structures, out of which the Predator verifier (and, of course, no hunter) 
handles 56 %. Next, out of the 328 tasks, 83 do not use linked data structures nor arrays, 
and 147 use them but are finite-state. The Predator verifier and the BFS hunter handle 
93 % of the 83 tasks that are so trivial that even the verifier does not use any abstraction. 
Out of the 147 tasks, 53 tasks are handled by both of them, while 2 tasks are handled 
solely by the verifier and 75 solely by the BFS hunter. 

A weakness of Predator is that it specialises in dealing with lists, and so it handles 
structures such as trees, skip-lists, or arrays in a bounded way, i.e., for error detection, 
only. Another weakness of Predator has traditionally been its weak treatment of non- 
pointer data. We have tried to improve on the latter weakness by the described heuristics 
for dealing with intervals of integers with a specific aim to improve the way Predator 
handles memory regions of parametric size. The results of PredatorHP on SV-COMP'20 
benchmarks with arrays show that the heuristics did help. Indeed, the interval sampling 
heuristic allowed us to correctly detect 10 errors in tasks from array-memsafety, 
array-examples,and loops. Moreover, the interval-splitting heuristic also helped 
on some benchmarks for dealing with interval-based sizes, offsets, and/or integer data. 
Namely, it removed 8 unknown results in ReachSafety and 4 such results in MemSafety. 

The new type analysis looking for presence of complex types allowed Predator to 
skip its main analysis loop in 77 tasks in the MemSafety category, of which 13 tasks 
(from termination-crafted) contain recursion, which Predator could not han- 
dle, and 6 tasks (from locks) would otherwise timeout. Due to the new support of 
reallocation, Predator verifies all tasks containing a call of realloc. Due to the added 
support of clobber instructions, Predator detects invalid memory accesses in bench- 
marks accessing variables outside of the block in which they were created. All other new 
improvements described above did also help in some cases and allowed PredatorHP to 
win the 1st place in the MemSafety category and in the ReachSafety-Heap sub-category. 


3 Contributors, Software Project, and the Tool Setup 


The main author of Predator is Kamil Dudka. Besides him and the PredatorHP team, 
Petr Müller, Michal Kotoun, and numerous other people listed in the docs/ THANKS 
file in the distribution of Predator have contributed to the distribution of Predator. 
Predator is an open source software project distributed under GNU GPLv3. The 
source code used in SV-COMP"20 is available too!. The README-SVCOMP- 2020 file 
shipped with it describes how to build the tool. The script predatorHP.py serves to 
run the tool, taking a verification task file as a single positional argument. Paths to both 
the property file and the desired witness file are accepted via long options, i.e., 64-bit 
compiler options. The verification outcome is printed to the standard output. To run 
PredatorHP in the BenchExec environment, the predatorhp.py wrapper and the 
predatorhp.xml benchmark definition can be used. In SV-COMP'20, PredatorHP 
participated in the MemSafety category and in the ReachSafety-Heap sub-category. 


i http://www.fit.vutbr.cz/research/groups/verifit/tools/predator-hp 
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Abstract. SYMBIOTIC 7 brings improvements in all parts of the tool. 
In particular, we integrated the advanced shape analysis implemented 
in Predator to our instrumentation process for memory safety checking. 
Further, we extended our slicer to correctly handle non-terminating pro- 
grams. This new slicing is applied in termination analysis, where we also 
added instrumentation for detection of simple cycles in the program state 
space. The witness generation process changed as well. 


1 Verification Approach 


SYMBIOTIC 7 follows the same basic schema as all previous versions [4,5]: the 
program to be verified is first instrumented (if needed), then reduced by static 
program slicing, and finally symbolically executed using KLEE [2]. We describe 
the main modifications since SYMBIOTIC 5 (participating in SV-COMP 2018) 
as modifications in SYMBIOTIC 6 (competing in 2019) have not been published. 


Memory safety checking improvements SYMBIOTIC uses a static pointer 
analysis to detect instructions that can potentially violate memory safety. To 
check these instructions, SYMBIOTIC 5 [5,3] instrumented the program with code 
that keeps records about allocated memory and uses the records to assert the 
validity of potentially misbehaving instructions. Then we sliced the program 
with respect to these assertions and called KLEE to check assertion validity. 

Since SYMBIOTIC 6, we slice the program directly with respect to the poten- 
tially misbehaving instructions without inserting any additional code. Then we 
call KLEE to check memory safety of the sliced program. 

SYMBIOTIC 7 newly integrates PREDATOR [6], a static analyzer specialized 
on memory safety. We first run PREDATOR in its over-approximating mode and 


* M. Chalupa, T. Jašek, P. Ayaziová, and J. Strejček have been supported by the Czech 
Science Foundation grant GA18-02177S. M. Hruška, V. Soková, and T. Vojnar have 
been supported by the IT4Innovations Excellence in Science project (LQ1602) and 
the FIT BUT internal project FIT-S-20-6427. 

** Jury member and corresponding author: chalupa@fi.muni.cz. 
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in a configuration that analyses all branches in the given program and tries to 
recover from found errors. If PREDATOR says that the program is safe, we simply 
answer true. Otherwise, we take bug reports from PREDATOR and combine them 
with results of our static pointer analysis to get a more precise (i.e., smaller) set 
of potentially misbehaving instructions. Then we proceed like SYMBIOTIC 6. 
SYMBIOTIC 7 is also the first version that can distinguish between valid- 
memcleanup and valid-memtrack properties. To do this, our clone of KLEE now 
reconstructs the shape of memory at the program exit if unfreed memory is 
found: KLEE starts with local and global variables and resolves pointers in these 
(if any). Then it resolves pointers in the pointed memory, etc. This way we can 
find out if the unfreed memory is reachable via a chain of dereferences or not. 


Termination analysis SYMBIOTIC 6 introduced a simple support for termi- 
nation property: a call to _VERIFIER_error is inserted before trivial infinite 
loops, e.g., while (true); loops. If the symbolic execution detects that such a 
call is reachable, SYMBIOTIC answers false as the program can reach an infinite 
loop. If all paths of the program are explored by symbolic execution without 
reaching any of these calls, all program executions are clearly terminating and 
we answer true (an infinite program path cannot be fully explored by symbolic 
execution). Note that program slicing was disabled for non-termination checking 
in SYMBIOTIC 6 as the slicer could remove infinite loops in some specific cases. 

SYMBIOTIC 7 brings two improvements. First, since we extended our slicer 
to correctly handle non-terminating programs [7], we now apply slicing with 
slicing criteria set to all exit points (including the instrumented error calls) of the 
program. Second, we instrument the program with checks for simple cycles in the 
state space. The instrumentation detects non-nested loops with a single entry 
for which it can conservatively determine a set {V1,..., Vp} that includes all 
variables potentially modified by the loop. At the beginning of the loop body, we 
insert assignments that store the value of each variable V; into a new variable V/. 
At the end of the loop body, we insert the assertion assert(Vi 4 VI V... V Vp Æ 
V4) to check a change in the vector of these variables. If this assertion is violated, 
the program has a non-terminating execution. 


Error path replay Although the slicer in SYMBIOTIC now provides algorithms 
that preserve non-termination properties of programs, outside the Termination 
category we still use the original non-termination insensitive slicing as it may re- 
move more instructions. The price is, however, that SYMBIOTIC may report false 
alarms: an unreachable error location situated below an infinite loop may be- 
come reachable when the loop is sliced out. To fix this issue, we try to reproduce 
each error found by symbolic execution in the original (unsliced) program. If the 
error is reproduced, we report it as a real error. Otherwise, we say unknown. 


Improved witness generation SYMBIOTIC 5 and 6 generated violation wit- 
nesses that describe only the initialization of non-deterministic variables at the 
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beginning of the main function. SYMBIOTIC 7, on the other hand, generates vi- 
olation witnesses that contain a complete test vector, i.e., the whole sequence 
of values returned from .. VERIFIER nondet * functions during the error path 
replay. To get and correctly identify all these values, we have modified our fork 
of KLEE to support interpretation of -VERIFIER nondet. * functions (and other 
undefined functions in general) internally. Currently, more than 9996 of our vio- 
lation witnesses (outside the Termination category) are confirmed. SYMBIOTIC 7 
still generates trivial correctness witnesses if no error is found. 


Other improvements Other improvements in SYMBIOTIC 7 used in SV- 
COMP 2020 include a faster data dependence analysis (a part of slicing) and 
better handling of assume statements in the slicer. SYMBIOTIC is now also able 
to continue in verification if the instrumentation or slicer crashes or exceeds the 
time limit. In such a case, KLEE is run on the original program which has been 
only optimized by standard LLvM optimizations. For SV-COMP 2020, we set 
the time limit of 400s on instrumentation and the time limit of 300s on slicing. 


2 Software Architecture 


SYMBIOTIC 7 is built on top of LLvM 8.0.1 [8]. The tool consists of a set of 
modules written in C++ that process LLVM bitcode, and Python scripts that 
chain these modules according to given configuration. 

For use in SYMBIOTIC, we have made several bugfixes in PREDATOR's LLVM 
backend and ported it to LLVM 8.0.1. Further, we have introduced distinguishing 
between safe and possibly erroneous program instructions. 

SYMBIOTIC uses its own fork of KLEE that contains several modifications 
compared to the mainstream KLEE. In particular, the fork has been extended 
to handle symbolic-sized memory allocations, to process marks delimiting the 
lifetime of scoped variables, to check for memory leaks, and to generate violation 
witnesses in the SV-COMP format. 


3 Strengths and Weaknesses 


In SV-COMP 2020 [1], SYMBIOTIC 7 won the SoftwareSystems category and 
scored second in the MemSafety category and the FalsificationOverall meta cat- 
egory. Overall, SYMBIOTIC ended up on the fourth place. 

The main reason for winning SoftwareSystems is having only a few incorrect 
answers. Indeed, SYMBIOTIC did not win in the number of correct answers in any 
of the SoftwareSystems subcategories. However, we had only 4 incorrect answers 
and all of them in the subcategory DeviceDriversLinux64. This subcategory is 
huge and these incorrect answers have only a small impact on the weighted score. 

In MemSafety, we took the second place after PREDATORHP which executes 
several instances of the PREDATOR tool with different configurations in parallel. 
SYMBIOTIC calls just one of these instances as mentioned above. Additionally, 
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PREDATORHP uses GCC, while we use PREDATOR running on LLVM, which is 
not as mature as the former. Also, we had a number of new unknown answers 
because KLEE does not support pointer comparisons, which we incorrectly did 
not detect in the previous versions of SYMBIOTIC. 

In general, SYMBIOTIC’s results stems from the good performance of KLEE 
supported by efficient static analysis and slicing: the official results show that 
SYMBIOTIC can decide many benchmarks very quickly. 

The main weakness of our tool is the inherent complexity of symbolic exe- 
cution and the limited possibility of analysing potentially unbounded loops or 
infinite paths with this technique. Indeed, as symbolic execution actually fol- 
lows all paths in the program, it does not terminate if the program contains an 
unbounded loop or an infinite path (unless an error is found). Even when the 
number of paths is finite and all the paths are finite, symbolic execution usually 
runs out of resources if the number of paths is large. Although this problem 
is slightly alleviated by program slicing, our tool still does not scale well on 
complex programs. 


4 Tool Setup and Configuration 


— Download: From the competition archives or via http://doi.org/10.5281/ 
zenodo.3678328. 

Installation: Unpack the archive. 

Participation Statement: SYMBIOTIC 7 participates in all categories. 


Execution: Run bin/symbiotic --sv-comp OPTS «source», where avail- 
able OPTS include: 


--prp=file, which sets the property specification file to use, 
--witness-file, which sets the output file for the witness, 
--32, which sets the 32-bit environment, 

--help, which shows the full list of possible options. 


5 Software Project and Contributors 


SYMBIOTIC 6 and 7 have been developed by M. Chalupa, T. Jašek, M. Vitovska, 
M. Šimáček, L. Tomovié, and P. Ayaziová under the supervision of J. Strejček. 
Predator has been adjusted for the described integration by M. Hruska and 
V. Soková under the supervision of T. Vojnar. SYMBIOTIC and its components 
are available under the MIT license. The project is hosted by the Faculty of 
Informatics, Masaryk University. KLEE, LLVM, and PREDATOR are also available 
under open-source licenses. Source codes of the project and references to all its 
components can be found at: 


https://github.com/staticafi/symbiotic 
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Abstract. ULTIMATE TAIPAN is a software model checker that combines 
trace abstraction with abstract interpretation on path programs. In this 
year’s version, we replaced our abstract interpretation engine and now use 
a combination of multiple abstraction functions, fixpoint computation, 
algebraic program analysis, and SMT solving. Our new approach will allow 
us to integrate new techniques more easily. 


1 Verification Approach 


ULTIMATE TAIPAN is a software model checker which combines trace abstrac- 
tion [8] and abstract interpretation [5]. The algorithm of TAIPAN follows the trace 
abstraction verification scheme for reachability where it constructs an abstraction 
of the program as a nested word automaton (NWA). This NWA has initially the 
same graph structure as the program’s interprocedural control flow graph (ICFG), 
its states are program locations, its transitions are labeled with program locations, 
and states corresponding to error locations are accepting. Hence, the automaton 
recognizes a language where the symbols are statements and the words are se- 
quences of statements (which we call traces) that lead to an error location. If the 
language of the abstraction automaton is empty, no error location can be reached 
and the program is safe. If there is a trace in the language, the algorithm needs to 
determine if it is a feasible trace, i.e., a trace that corresponds to an actual program 
execution, or not. Feasible traces constitute an actual counterexample and if one 
is found the algorithm terminates. If an infeasible trace is found, TAIPAN's algo- 
rithm differs from trace abstraction and does not only analyze the actual trace, 
but rather constructs a path program! from this trace. It then tries to synthesize 
inductive invariants for the whole path program [7]. From these invariants, a new 
automaton is constructed which language only recognizes infeasible traces. The 
new abstraction is then constructed as the difference of the automaton that only 
recognizes infeasible traces and the old abstraction automaton. If the error loca- 
tion's invariant of the path program is not false, the computed invariants are 
too weak to prove infeasibility, and TAIPAN falls back to using interpolating SMT 
solvers to compute new invariants that are strong enough to discharge the trace. 


Daniel Dietsch — Jury Member 
! A path program is a projection of the program to the trace. 
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TAIPAN’s old algorithm used abstract interpretation to analyze path programs. 
In this year’s iteration, we use a new approach, which is motivated by two draw- 
backs of our old algorithm. Firstly, extending an abstract interpretation engine 
with new abstract domains is labor-intensive and error-prone. Each abstract do- 
main has an abstract post operator describing the effect program statements have 
on abstract states. For each abstract domain and each type of program statement 
the abstract post operator has to be defined and implemented, and re-use between 
domains is complicated. Furthermore, each abstract domain needs their own rep- 
resentation of an abstract state, s.t. exchanging information between multiple 
domains requires explicit conversions. Secondly, Abstract interpretation always 
abstracts. Because each abstract domain has its own abstract state representa- 
tion, it is usually not possible to implement a precise post operator. Hence, every 
application of post is an abstraction, which leads to unnecessary loss of precision. 


Invariant ¢ for LOI 


Work Item 
Procedure p + Input ¢ 


C N 
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Fig. 1: Overview of the symbolic interpretation engine. 


Our new approach is inspired by Algebraic Program Analysis [9, 4] and the re- 
newed interest in this technique (e.g. [6]), and Logical Interpretation [10]. We use 
the modularity of algebraic program analysis to combine different techniques in an 
unifying framework and the idea of a shared representation of abstract program 
states as SMT formulas over which abstraction operators can compute fixpoints 
from logical interpretation. 

An overview of our approach is depicted in Figure 1. The approach consists of 
two major components, the ICFG interpreter and the DAG interpreter. 

'The ICFG interpreter component generates for a (partial) interprocedural con- 
trol flow graph (ICFG) and a subset of its program locations (locations of interest, 
LOI) a set of path expressions represented as RegexDAGs. A RegexDAG is a di- 
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rected acyclic graph with vertices that are labeled with regular expressions over the 
program’s statements without calls and returns but with summary and enter state- 
ments. Each RegexDAG has exactly one sink node that represents a location of in- 
terest. We use summary statements when we call to and return from a procedure on 
a path to a LOI, and enter statements when we do not return until we reach the LOI. 


The DAG interpreter component then analyses a RegexDAG in topological 
order by applying different operators (Call Sum., Loop Sum., post op.) to the 
different vertex labels. All operators take a program state expressed as SMT for- 
mula $ and a regular expression over program statements (i.e., a vertex label) 
and produce a new (possibly abstracted) program state that captures all the ef- 
fects. If a vertex has multiple incoming edges, the different input states are simply 
joined with a logical disjunction (V). Some of these operators depend again on 
the ICFG interpreter to compute their result. The most basic operator is the post 
operator (post op.), which computes strongest post for star-free regular expres- 
sions and optionally applies an abstraction function to the result. The choice of 
abstraction function and if to apply them is governed by different heuristics that 
can be changed. We call these heuristics fluids. The other operators are the call 
summarization (Call Sum.) and loop summarization (Loop Sum.) operators. The 
call summarization operator computes a summary for a procedure call, either 
with or without considering the context. The loop summarization operator com- 
putes a summary for the Kleene-star operator of regular expressions. Our current 
implementation does this by computing a fixpoint and resolving nested loops by re- 
cursively inserting summaries. The different operators (post, call summarization, 
loop summarization) are completely modular and can be considered black-boxes 
for the interplay between the two main components. When the DAG interpreter 
reaches the sink vertex of the RegexDAG, it returns the disjunction of this sink's 
input program states as invariant for this LOI. 


2 Strengths and Weaknesses 


Our new approach is easy to extend with new abstraction functions, fluids, and 
loop acceleration techniques. Compared to the previous approach we also gain 
much more precision by, e.g., having a reduced product between different kinds of 
abstraction without writing a transformation function — we can just use the logical 
disjunction. Using SMT formulas as representation of program states also allows 
us to reuse many of ULTIMATE’s existing tools that deal with SMT, in particular 
simplification, quantifier elimination,, rewriting, and debugging. 


Nevertheless, our current implementation is not as effective as the old one, 
because we did not finish porting the various abstract domains. We currently 
only support a basic interval abstraction and an explicit value abstraction, which 
severely limits the efficiency of our approach. We are also missing more intricate 
loop acceleration implementations, optimized fluid configurations, and our imple- 
mentation does not yet support recursion. 
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3 Architecture, Setup, Configuration, and Project 


ULTIMATE TAIPAN is a part of the open-soure program analysis framework ULTI- 
MATE?’?, written in Java and licensed under LGPLv3^4. We used TAIPAN version 
0.1.25-f470102c in our competition submission, which is available as a . zip archive 
from multiple sources?:9:7, Our submission requires Java 1.8 and Python 3.x. The 
submission contains an executable version of TAIPAN for Linux platforms, the 
binaries of the required SMT solvers Z3°, CVC4°, and MATHSAT!®, as well as a 
Python script, Ultimate.py, which maps the SV-COMP interface to ULTIMATE’s 
command line interface. TAIPAN is invoked with 


./Ultimate.py --spec prop.prp --file input.c --architecture 
32bit|64bit --full-output 


where prop.prp is the SV-COMP property file, input.c is the input C file, 
32bit or 64bit is the architecture, and --full-output enables verbose output 
to stdout. The output of TAIPAN is written to the file Ultimate.log. A viola- 
tion [3] or correctness [2] witness may be written to the file witness.graphml. 
The benchmarking tool BENCHEXEC [1] supports TAIPAN through the tool-info 
module uitimatetaipan.pyl!. TAIPAN participates in all categories, as declared 
in its SV-COMP benchmark definition file utaipan.xm1!?. 
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Abstract. We present AVR, a push-button model checker for verifying 
state transition systems directly at the source-code level. AVR uses infor- 
mation embedded in the word-level syntax of the design representation 
to automatically perform scalable model checking by combining a novel 
syntax-guided abstraction-refinement technique with a word-level imple- 
mentation of the IC3 algorithm. AVR provides independently-verifiable 
certificates that offer provable assurance and are easy to relate to the 
word-level system. Moreover, proof certificates can be further used in 
innovative ways to extract key design information and are useful in a 
growing number of applications. 


1 Introduction 


Model checking [27,28] techniques based on incremental induction (like IC3 
[19,31]) have gained significant success [21] due to their property-directed na- 
ture and clever use of incremental SAT solving. Bit-level implementations of IC3, 
however, struggle with scalability due to being overwhelmed by low-level propo- 
sitional learning [33]. Rapid advances in SMT solving [54,12] offer a solution and 
allow for performing IC3 directly at the word level by combining the incremental 
induction algorithm with an abstraction-refinement procedure [18,41,23,34]. 
AVR [2] is a model checker designed, primarily, for verifying safety properties 
of hardware. It uses syntaz-guided abstraction [34], a generalization of implicit 
predicate abstraction [22], to perform IC3-style reachability on a first-order logic 
encoding of the transition relation resulting in word-level clause learning. Upon 
termination, AVR will either produce a proof certificate, in the form of a state 
formula representing an inductive invariant, if the safety property holds or a 
counterexample execution trace if it fails. In both cases, confidence in the veri- 
fication output is achieved by using an external proof checker to independently 
confirm the correctness of the proof certificate or a trace simulator depicting the 
sequence of transitions leading to the failure. Beyond hardware, these features 
allow AVR to be used in innovative ways including the verification of distributed 
protocols defined over unbounded domains [44,45]. AVR also provides a variety of 
complementary verification techniques, such as data abstraction and interpola- 
tion, to increase its scalability, as well as useful utilities, such as design statistics 
and graphical visualizations, to provide high-level insights on the input design. 
AVR was independently evaluated to be the best word-level verifier in the single 
bit-vector track of Hardware Model Checking Competition (HWMCC) 2019 [17]. 


© The Author(s) 2020 
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2 Motivation 


Consider a predicate p := (a +b < 1) defined over two 32-bit variables a and 
b. An equivalent propositional-level representation of p will involve a bit-blasted 
expression involving 64 Boolean variables and several hundred clauses. As a 
consequence, bit-level model checking algorithms do not scale as variable bit 
widths increase and suffer from the so-called state-space explosion problem [26]. 

AVR derives its motivation from the fact that the word-level representation 
of a problem contains useful high-level information that can be exploited for 
better scalability. Building on our previous work [33,34], AVR. uses this insight 
to infer an implicit syntax-guided abstraction using terms built from objects 
present in the word-level syntactic description of the problem (like a, b, 1, +, 
<). The approach can be further combined with data abstraction using unin- 
terpreted functions [20,11] to simplify reasoning for the underlying query solver. 
This, coupled with efficient SMT solving, allows for an effective word-level model 
checking algorithm that can scale better than bit-level engines for a variety of 
verification problems. Moreover, the underlying induction-based verification pro- 
cedure has the unique strength of producing word-level proof certificates that 
are useful in a variety of applications [32,37,45,44]. 


3 System Architecture 


Multi- Frontends AVR Core SMT Solver 
H * 
Engine oe Pre-processor Backends 
Wrapper (ula yea - simple optimizations Yices 2 
y - property-directed splitting (UF, BV, LIA, Arrays) 
VMT IC3 + SA Boolector 
7 BV, Ai 
Meuse a) y Features - incremental refinement SS ee 
- incremental caching MathSAT 5 
BTOR2 - extract / concat handler | (UF, BV, LIA, Arrays, 
(via Btor2 Tools) Add-ons - data abstraction interpolation) 
- interpolation 
- bounded model checking Z3 
o o ; (UF, BY, LIA, 
Utilities  - certificate printer future extensions) 
- .dot visualizer *quantifier free 
v vmt .btor2 af proof.smt2 X cex.witness .smt2 .Stats .dot 
User Input Certificate Utilities 


Fig. 1: Verification flow with AVR 
UF: uninterpreted functions, BV: bit-vectors, LIA: linear integer arithmetic 


Fig. 1 shows the architecture and verification flow of AVR. 


Frontends in AVR extract the model checking problem from inputs in different 
formats using openly-available tools. 

— Verilog + SystemVerilog Assertions [9] (using Yosys [55]) 

— VMT [8] (using MathSAT 5 [24]) 

~ BTOR2 [51] (using Btor2Tools [3]) 
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AVR core performs IC3 with syntax-guided abstraction (IC3+SA) and imple- 
ments several verification techniques and utilities (detailed in §3.1, §3.2). 


SMT solver backends use the latest versions of state-of-the-art SMT solvers 
(Yices 2 [30], Boolector [50], MathSAT 5 [24] and Z3 [48]) to efficiently integrate 
incremental solver reasoning with AVR core using a C++ interface. 


Multi-engine wrapper allows for process-level parallelism by running multiple 
instances of AVR in parallel using proof race (as elaborated later in §3.3). 


3.1 Techniques 


At its core, AVR implements a word-level IC3 procedure where terms in the 
implicit syntax of the problem are used as building blocks to perform IC3-style 
clause learning at the word level using SMT solving. The key differences be- 
tween IC3+SA [34], as implemented in AVR, and bit-level IC3 [19,31] can be 
summarized as follows: 
— I1C3+SA uses relations defined over syntax terms (referred as atoms) instead 
of individual state bits to implicitly represent an abstract state space. 
— SMT solving is used instead of propositional SAT solving for solver reasoning. 
— Counterexample-guided abstraction refinement [25] is used to automatically 
eliminate the spurious behavior in the syntactically abstracted domain by 
identifying new terms from the proof of unsatisfiability [42]. 


Within the core IC3+SA framework, AVR implements several optimizations and 
important features that are helpful in improving model checking performance. 


Core features 


— Pre-processor optimizations perform simple transformations to standardize 
and optimize the input model extracted from different input formats. 

— Incremental refinement performs abstract counterexample analysis in an in- 
cremental fashion by using single-step solver queries instead of conventional 
multi-step path queries. 

— Incremental caching allows caching frequently-used data structures to speed 
up incremental SMT solving (at the cost of increasing memory usage). 

— Multiple SMT backends allow configuring usage of different SMT solvers for 
different kinds of SMT queries based on the type of query. 


Add-on techniques 


— Property-directed splitting breaks wide words at bit-field extraction and con- 
catenation boundaries [10] in a property-directed manner. 

— Data abstraction focuses on the control structure of the problem by com- 
bining IC3+SA with data abstraction which converts data operations to 
uninterpreted functions [20,11,41],. 

— Interpolation adds Craig interpolants [46] and incremental refinement to 
extract new terms from a spurious abstract counterexample. 

— Extract/Concat handler adds a novel dedicated engine to deal with light- 
weight interpretation of bit-field extraction and concatenation operations. 
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— Bounded model checking (BMC) [15] allows for an alternative to the IC3+SA 
engine for quick bug hunting, especially for shallow bugs. 

— Other options include adding global assumptions lazily, minimizing proof 
certificates, making syntax-guided abstraction closer to (resp. farther from) 
implicit predicate abstraction by decreasing (resp. increasing) abstraction 
granularity, exploiting randomness during solving, and a few others. 


Utilities 
AVR also provides a number of useful utilities to the user including: 
— Printing the problem in SMT-LIB format [13]. 


— Graphical visualizations of the problem and the word-level clause learning. 
— Detailed statistics report on the input design and the verification run. 


3.2 Certificates 


Once a model checking problem is solved, there can be two possible outcomes: 
either the property holds (safe), or it fails (unsafe). 

If the property holds, IC3+SA produces an inductive invariant, i.e. an ap- 
proximate fixpoint that establishes the property to be true in all executions of 
the system. Inductive invariants act as proof certificates that guarantee the cor- 
rectness of the verification outcome. AVR prints such proof certificates directly in 
the SMT-LIB format, which allows for independent checking of their correctness 
using an external SMT solver like Yices 2 or Z3. Since proof certificates are in 
the word-level format, they are human-readable and much easier to relate to the 
word-level input directly at the source-code level (as against bit-level invariants 
which are usually too hard to understand). Proof certificates have many use- 
ful applications, including the derivation of inductive validity cores [32], gaining 
deeper insights on design behavior, deriving assume-guarantee verification condi- 
tions [37,53], deriving helper assertions during multi-property verification [36,29], 
and generalizing to quantified domains (as elaborated later in §4.3). 

When the property fails, AVR produces a counterexample trace that estab- 
lishes how to reach a bad state (a state where the property is false) starting from 
an initial state. AVR prints the counterexample witness in BTOR2 witness for- 
mat [51], which allows for independent verification of the execution trace using 
a BTOR2 witness simulator [4]. This allows the designer to debug and pin-point 
the source of error by analyzing the execution leading to the buggy state. 


3.3 Proof Race 


AVR, supports a variety of configurations and add-on features (as discussed 
in §3.1). Without detailed knowledge of the input, it is hard to tell upfront which 
technique will perform the best. Different configurations are useful to tackle dif- 
ferent types of problems, though manually trying different configurations can 
become tedious for the user. To counter this, AVR offers a multi-engine wrap- 
per called proof race that automatically runs multiple instances of AVR with 
different configurations in parallel and offers process-level parallelism. Given a 
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set of specified resource limits, proof race initiates multiple AVR instances and 
terminates execution as soon as one of these instances successfully races to the 
result. Such a portfolio-based approach is crucial in practice for fast verification 
performance since no single technique performs best in all cases [21,16]. It is 
also further strengthened by complementing AVR’s word-level techniques with 
state-of-the-art model checking engines like ABC dprove [14], IC3ia [23] etc. 


4 Case Studies! 
4.1 Apache Buffer Overflow 


We consider patched versions of two buffer overflow vulnerabilities [40] from 
standard modules of the Apache web server [1]. 

apache-escape-absolute corrects a high severity vulnerability CVE2006-3747 
[7] that fixes the out-of-bounds buffer overflow exploitation which allows a remote 
attacker to cause a denial of service and execute arbitrary code via crafted URLs. 
The patched version corrects a check (c < TOKEN_SZ) to (c < TOKEN_SZ — 1). 

apache-get-tag fixes a medium severity vulnerability CVE-2004-0940 [6] that 
exploits a buffer overflow when copying user-supplied tag strings into finite 
buffers. A local attacker may leverage this issue to execute arbitrary code on 
the affected computer with the privileges of the affected Apache server. The 
patched version corrects a check that validates the length of the tag strings. 

In less than a minute, AVR successfully verifies that both of these buffer 
overflow exploits are unreachable in the patched versions for any buffer size. 
AVR also provides human-readable proof certificates that are externally verified 
using Z3, and provides provable assurance against these security vulnerabilities. 


4.2 Public Key Authentication Protocol 


The Needham-Schroeder public key authentication protocol [49] allows establish- 
ing mutual authentication between an initiator A and a responder B, after which 
some session involving the exchange of messages between them can take place. 
Unfortunately, this protocol is vulnerable to a man-in-the-middle attack [43]. If 
an intruder J can persuade A to initiate a session with him, he can relay the 
messages to B and convince B that he is communicating with A. 

We consider an instance of the protocol from HWMCC’19 [17,52] with 3 
initiators and responders each, and with an unsafe state defined as a responder 
being finished authentication with the intruder as a party. Within a minute, 
AVR finds an execution trace that establishes how to reach an unsafe state. The 
counterexample witness produced by AVR can be replayed using the BtorSIM 
simulator [4] to verify the execution trace and to debug the protocol. 


4.3 Verifying Distributed Protocols 


Beyond verifying model checking problems from finite domains, AVR has shown 
preliminary application in the verification of distributed protocols, which are 


1 All results presented in this paper can be replicated from [35,5]. 
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generally expressed over unbounded domains (with an unbounded number of 
clients, servers, epochs, messages, etc.). The I4 system [45,44] demonstrates how 
AVR. can be used to verify a simpler finite version of the protocol, followed by 
generalizing AVR’s proof certificates to the unbounded domain. For example, a 
finite-domain invariant saying “clients Cı and C2 cannot both link to the server 
S” ie. a(link(Cy, S) Alink(C2, S)) can be generalized to the unbounded domain 
as “no two different clients can both link to a server” i.e. 

Vor,C2,8 (C1 # C2) => ~(link(C1, S) A link(C2, S)). 


5 Strengths 


Control-centric properties, where much of the complexity lies in the control logic 
(such as sequential equivalence checking, microprocessor instruction control unit, 
key-value store) are much easier to verify using AVR. Syntax-guided abstraction 
hides the domain complexity outside of the problem syntax, and automatically 
separates important control-flow details from the irrelevant data component. 
This, combined with data abstraction, allows for scalable model checking with 
the capacity to scale independently of the variable bit widths [33,34]. 


Push-button verification using AVR eliminates the need for tedious human inter- 
vention in verification (such as manual identification of abstraction predicates, 
manually adding helper assertions) by automatic incremental construction of 
abstraction and word-level clauses using the IC3+SA algorithm. 


Provable assurance on the verification outcome is guaranteed by AVR using 
independently-checkable proof certificates and counterexample traces. 


Useful utilities that AVR provides, such as support for multiple input formats, 
efficient integration with state-of-the-art SMT solvers, proof race, high-level sys- 
tem statistics, graphical visualizations, etc. contribute to a user-friendly experi- 
ence and ease of use. 


6 Limitations 


Heavy data dependency can make word-level techniques in AVR ineffective for 
certain problems, especially when a majority of bit-precise values in the data 
domain play an important role (for example, puzzle solving problems like Tower 
of Hanoi [39], Peg Solitaire [38], etc. formulated as reachability problems [52]). 
Logic synthesis and bit-level optimizations [14,47] can be very useful for such 
problems and help bit-level checkers perform better than word-level techniques 
by significantly decreasing the problem complexity at the bit level. 


First-order logic fragments beyond quantifier-free bit-vectors, arrays and unin- 
terpreted functions (such as non-linear arithmetic, floating-point numbers, quan- 
tifiers, etc.) and properties beyond safety (such as liveness and fairness) have 
limited support in the current tool implementation. AVR’s primary focus has 
been on verification of safety properties defined on hardware systems. 
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7 Conclusions 


AVR provides a variety of techniques to efficiently perform automatic word-level 
verification using SMT solvers with provable guarantees and security. AVR has 
been effective in hardware verification [17,33,34] and shows significant promise 
for the verification of distributed protocols [44,45]. In the future, we plan to 
address some of its current limitations and extend its application to practical 
verification problems beyond the hardware domain. 


Data Availability Statement and Acknowledgments. The software and datasets 
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Abstract. Prior research has shown how to construct a mechanically 
verified model checker for timed automata, a popular formalism for 
modeling real-time systems. 

In this paper, we shift the focus from verified model checking to certify- 
ing unreachability. This allows us to benefit from better approximation 
operations for symbolic states, and reduces execution time by exploring 
fewer states and by exploiting parallelism. Moreover, this gives us the 
ability to audit results of unverified model checkers that implement a 
range of further optimizations, including certificate compression. 

The resulting tool is evaluated on a set of standard benchmarks to 
demonstrate its practicality, using a new unverified model checker imple- 
mentation in Standard ML to construct the certificates. 


Keywords: Timed automata - Certification - Model Checking - Interac- 
tive Theorem Proving - Isabelle/HOL 


Timed automata [1] are a widely used formalism for modeling real-time systems, 
which is employed in a class of successful model checkers such as UPPAAL [4]. 
These tools can be understood as trust-multipliers: we trust their correctness to 
deduce trust in the safety of systems checked by these tools. As a consequence, 
one wants to ensure as rigorously as possible that the computation results of 
timed automata model checkers are correct. 

Previous work [31] has addressed this problem by constructing a model checker 
for timed automata that is fully verified using Isabelle/HOL [25]. This tool is 
intended to be a reference implementation that can be used to scrutinize the 
correctness of other model checkers. As such, it is mainly able to check small 
and medium-sized benchmark examples, but the performance gap w.r.t. more 
practical model checkers prevents it from checking realistic benchmark models 
within reasonable time and space bounds. 

We address this issue by shifting the focus from full verified model checking to 
only certifying that the result produced by an unverified model checker is correct. 
We only study reachability: it is the most important property that is checked 
with timed automata model checkers, and some model checkers only support 
reachability. It is crucial to ensure that a bad state is certainly not reachable if 
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the model checker claims so, thus we want to certify unreachability. Certifying 
that a state is indeed reachable would amount to extracting a timed trace and 
certifying that the trace is compatible with the model. While implementing this 
in a verified manner would be comparatively easy, we consider it less important 
because it corresponds to the bug finding functionality of model checkers, which 
carries less trust. 

The recipe for certifying unreachability is simple: the model checker explores a 
number of states until it determines that there are no more states to be found. If 
none of the states fulfill the final state predicate (i.e. violates the safety property), 
then the model checker will answer “unreachable”. We use the set of explored 
states as the unreachability certificate. In essence, we only need to check that 
the initial state is contained in this set, that there are no outgoing edges from 
this set, and that none of the states in the set fulfill the final state predicate. 

The switch to certification holds many advantages. Timed automata model 
checking uses over-approximations of symbolic states to ensure termination. A 
large variety of these approximation operators has been studied [2,3,14]. Our 
previous work [29] has shown that, while formally proving the correctness of these 
approximation operations is feasible in principle with an interactive theorem 
prover, the effort is rather high. Instead, to certify unreachability, it is sufficient 
to only know that the approximation operator indeed yields a state that is at 
least as big as the precise symbolic state. Certifying this property is cheap. 

Moreover, certification eases parallelization. Checking that a state is not final 
and that all its successors are covered by the state set are local properties. We 
show how to exploit this in a verified implementation, while only mildly increasing 
the verification effort and the size of the trusted code base. 

Finally, the number of states explored by a model checker can vary immensely, 
depending on a range of factors such as the chosen approximation operator or 
the search order. Thus, an efficient unverified tool can exploit different heuristics 
and strategies to compute a state space that is as small as possible, and thereby 
speedup the certification effort. In this context, we also study a number of 
compression techniques to reduce the number of states in the certificate after the 
model checker has concluded its search. 

We use a new unverified model checker called Mlunta, which is implemented 
in Standard ML (SML), to generate certificates for a set of standard benchmarks, 
and to evaluate our verified certifier’s performance on these benchmarks !. 


Related Work This work is based on an existing Isabelle/HOL formalization of 
timed automata model checking [29,31]. Other proof-assistant formalizations of 
timed automata focus on proving elementary properties about the basic formalism 
[33,34], or proving properties about concrete automata [26,10,8], but none of 
them are concerned with model checking. 

Earlier work formalizes a model checker for the modal p-calculus [28], and 
constructs a verified finite state LTL model checker [9,24,6]. 


' Both tools are available online: https://doi.org/10.5281/zenodo.3679245. 
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The idea of extracting certificates from the model checking process has 
previously been studied in the context of the -calculus [23] and finite state 
LTL model checking [27]. However, these works are not accompanied by a 
verified certificate checker and do not attempt to scale the approach to practical 
examples. Only the recent work of Griggio et al. [11] provides a practical extraction 
mechanism and a certificate checker for LTL model checking, but the checker is 
not verified. To the best of our knowledge, we are the first to examine certification 
in the context of timed automata model checking. 

Finally, in the context of software verification, the idea of producing certificates 
for the correctness of a program has been broadly studied [16,5]. 


Isabelle/HOL Isabelle/HOL [25] is an interactive theorem prover based on Higher- 
Order Logic (HOL). HOL can be thought of as a combination of a functional 
programming language and mathematical logic. Isabelle/HOL mostly resembles 
standard mathematical notation. Some conventions that are borrowed from 
functional programming need to be explained, however. Functions are mostly 
curried, i.e. of type 7 = T2 => T instead of T} X T2 => T. As a consequence, 
function application is usually denoted as f a b instead of f(a,b). Function 
abstraction with lambda terms uses the standard syntax Ax. t (the function that 
maps «x to t) and can also have paired arguments A(x, y). t. Type variables are 
written ‘a, 'b, etc. Compound types are written in postfix syntax: T set is the 
type of sets of elements of type T. We use the Isabelle/HOL convention that 
free variables are implicitly all-quantified throughout the paper. In parts of the 
paper, formulas or syntax have been simplified for readability, but we have stayed 
largely faithful to the Isabelle/HOL formalization. 


Contributions In short, these are the main contributions of our work: 


— To the best of our knowledge, we are the first to study certification of the 
model checking results of reachability checking for timed automata, including 
techniques to compress certificates. 

— We construct a verified implementation of such a certificate checker, including 
a number of optimization techniques to make it practically usable. 


Outline The remainder of the paper is organized as follows. The first section 
briefly recalls the theory of timed automata, and sketches the state-of-the-art 
model checking process. The second section details our approach to certification 
and explains how, starting from an abstract theory, a concrete verified imple- 
mentation of the certificate checker can be obtained. Section three illustrates a 
number of techniques to improve the certificate checker’s performance, while only 
mildly increasing the formalization effort. Section four discusses two methods for 
certificate compression. The paper is concluded by an experimental evaluation 
and remarks on potential future work. 
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1 Timed Automata and Model Checking 


Transition Systems We take a very simple view of transition systems: they are 
simply a relation — of type ‘a = ‘a = bool for a type of states ‘a. We write 
a —* b to denote that b can be reached from a via a sequence of —-transitions. 


Timed Automata To make the paper self-contained, this paragraph briefly de- 
scribes timed automata and is mostly reproduced from Wimmer and Lammich 
[29]. For a thorough introduction see the tutorial paper of Bengtsson and Yi [4]. 

Compared to standard finite automata, timed automata introduce a notion 
of clocks. Figure 1 depicts an example of a timed automaton. We will assume 
that clocks are of type nat. A clock valuation u is a function of type nat => real. 
Locations and transitions are guarded by clock constraints, which have to be 


a2, Cy = 0 


cı > 0, ay, Co := 0 


cı > 100, a4 
> 


cy <1Ac < 100 


a3, Cy := C2 := 0 


Fig. 1: Example of a timed automaton with two clocks. 


fulfilled to stay in a location or to take a transition. Clock constraints are 
conjunctions of constraints of the form c ~ d for a clock c, an integer d, and 
~ € {<,<,=,>,>}. We write u — cc if the clock constraint cc holds for the 
clock valuation u. We define a timed automaton A as a pair (7,7) where Z is 
a mapping from locations to clock constraints (also named invariants); and T 
is a set of transitions written as A F l —3%%" |’ where l and l’ are start and 
successor location, g is the guard of the transition, a is an action label, and r is 
a list of clocks that will be reset to zero when the transition is taken. States of 
timed automata are pairs of a location and a clock valuation. The operational 
semantics defines two kinds of steps (given as their HOL descriptions): 


— Delay: (l u) 34 (l,u@d) ifd>O0andu@d ETI; 
— Action: (1,u) >a (l', [r := Oļu) 
if AFL 39%" I’, u H g, and [r := Oju =T l’; 


where u@d = (Ac. u c + d) offsets all clocks by d in the valuation u, and 
[r := Oļu = (Ac. if c € r then 0 else u c) resets all clocks in r to 0 in valuation u. 
For any (timed) automaton A, we consider the transition system 


(1,u) >a (l, u) = (Gd > 0. Ja u”. (l u) 34 (lW) A (L,u") >a (I',v’)). 
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That is, each transition consists of a delay step that advances all clocks by some 
amount of time, followed by an action step that takes a transition and resets the 
clocks annotated to the transition. Given a final state predicate F and an initial 
state (lo, uo), we are interested in whether (lo, uo) >% (l, u) for any l, u with Fl. 
In Figure 1, the final state is l3 (i.e. F l <> l = 13). As the guard for action a4 
is never enabled, l3 is unreachable. 


Model Checking Due to the use of clock valuations, the state space of timed 
automata is inherently infinite. Thus, model checking algorithms for timed 
automata are based on the idea of abstracting from concrete valuations to sets 
of clock valuations of type (nat = real) set, often called zones. The resulting 
transition system of reachable states from an initial zone is called the zone graph. 
It is explored in an on-the-fly manner, computing successors on zones, which 
are typically represented symbolically as Difference Bound Matrices (DBMs). 
Knowledge of this data structure is not necessary to understand the rest of the 
paper. Thus we refer the interested reader to Bengtsson and Yi [4] and to Wimmer 
and Lammich [29,31] for a verification of this data structure. In the remainder 
we will only use the term “zones” instead of referring to their implementation as 
DBMs. 

The delicate part of this method is that the number of reachable zones could 
still be infinite. Therefore, over-approximations (or abstractions) of zones are 
computed to obtain a finite search space. For our purpose, it sufficient to assume 
an abstraction operator œ indeed computes an over-approximation, i.e. Z C a(Z) 
for any zone Z. We call the version of the zone graph where abstractions are 
applied the abstract zone graph [13]. For a number of such abstraction operators, 
it can be shown that the abstract zone graph is sound and complete 7. The 
proofs are rather intricate, however. Thus formalizing them would be a big effort. 
By focusing on certification of unreachability, this problem vanishes, as we only 
need to ensure that any state (l, Z) that we deem reachable in the zone graph is 
subsumed by some state (l, Z’) with Z C Z’ that is part of the certificate and 
that was computed by the abstraction (i.e. Z’ = a(Z1) for some Z1). 


Certificates by Example Figure 2 depicts the zone graph of the automaton in 
Figure 1. Each zone Z is given as a clock constraint cc such that Z = {u | u cc}. 
A model checker like Munta would have to explore the full zone graph before 
being able to decide that l3 is unreachable. Any model checker that uses the same 
abstraction technique as Munta [2] would not be able to benefit from abstractions 
for this example and thus the abstract zone graph is the same as the zone graph. 
However, such a model checker could apply subsumptions while exploring the zone 
graph. That is, when a symbolic state of the form (l2, {u| u H cı = 0Acg < k+1}) 
is explored, the state (l2, {u| u = cı = 0A co < k}) can safely be discarded. 
This means that at the end of the model checking process, only the three 
states in Figure 3a will be stored. The solid edges are part of the zone graph, 


? Soundness: for every abstract run, there is a concrete instantiation. Completeness: 
every concrete run can be abstracted. 
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(2,0<a <1Ac=0) 


l2,¢, =OAcG <1 


Vv 
«l2, =0Ac2 <2 


—(l1,c, =c@=0 


l2, C1 =O0AqQ< 100 


Fig. 2: The zone graph of the automaton depicted in Figure 1. 


while the dashed edge indicates that the zone at its tail has a successor in the 
zone graph ((l2, {u|u H cı = 0 ^c < 1})) that is subsumed by the tip of 
the edge. The set of these three states can act as a certificate of unreachability. 
They essentially form an inductive invariant of the zone graph: for each state in 
the certificate, all its successors in the zone graph are either contained in the 
certificate themselves or subsumed by another state in the certificate. Thus we 
know that any symbolic state that is reachable from the initial state is subsumed 
by some state in the certificate, and as the final state is not contained in the 
certificate, we can conclude that it is unreachable. 

Figure 3b shows a certificate with only two states that replaces the two states 
for lə by the state with a dashed border. Note that this state is not part of the 
original zone graph. The certificate fulfills the same invariant property and thus 
also proves unreachability. We will use this technique of adding larger states to 
the certificate that are not part of the zone graph for our compression techniques 
in section 4. 


1In,0<ce, <1AGQ=0 


EE neha 
Ig,¢, =OACQ < 100 rla, c3 < 100! 
ay 
(a) Stored states (b) Smaller certificate 


Fig. 3: Two certificates of unreachability for the automaton from Figure 1. 


Verified Certification of Reachability Checking for Timed Automata 431 


2 From Model Checking to Certifying Unreachability 


This section first describes our approach to certification abstractly. Then, we 
detail how the existing formalization of a timed automata model checker was 
extended—with rather low effort—to a verified certifier. In practice, networks 
of timed automata with additional modeling features such as, e.g. shared state 
variables, are used. However, due to the existing verified product construction for 
such a formalism [31], it is sufficient to study the case of a single timed automaton 
here. 


2.1 An Abstract Correctness Theorem 


To work towards a rigorous justification of the certification process, we first study 
the problem on a more abstract level. Consider a transition system — on states 
of type 'l x 's where 'l corresponds to the finite state part of timed automata 
and 's corresponds to zones. We assume an invariant P on states, i.e.: 


P (L, sı) A (h, sı) > (Io, s2) = P (Io, s2). 


This invariant essentially represents a restriction of + to valid states. While this 
would usually be assumed implicitly, we explicate P here as it is technically more 
convenient to do so in the Isabelle/HOL formalization. 

The interesting feature that sets timed automata model checking apart is 
subsumption. Recall that during the model checking process, it is possible to 
first discover some (symbolic) state (l, Z) (a pair of a discrete state J and a zone 
Z), and to find at some later point that another reachable state (l, Z’) subsumes 
(l, Z) because Z’ semantically contains Z’, i.e. Z C Z’. At this point the state 
(l, Z) can be discarded as we know that anything that is reachable from (l, Z) is 
also reachable from (l, Z’). Abstractly, subsumption is modeled by some fixed 
preorder (i.e. a reflexive and transitive relation) < on s which is a simulation 
relation between — and itself: 


Sı < S2/A (l, $1) —> (l2, t1) AP (lL, $1) AP (l2, s2) 
= Jta. ti < t2 A (l, s2) > (l2, t2) 


In the abstract setting, a certificate consists of a set of discrete states L of 
type ‘I set, and a mapping M of type 'l = 's set that gives the set of reachable 
symbolic states that were computed for any discrete state 1 € L. We say that 
(L, M) satisfies P if all states in the certificate (L, M) satisfy P: 


le LAse Mil = Pils) 


Moreover, the certificate needs to be closed. Following Herbreteau et al. [13], 
we call a state covered if it is subsumed by another state in the certificate. A 
certificate is closed if for each state in the certificate all its successors are covered: 


LELAS EMiA(h, sı) > (Io, s2) => he LA (3s3 E€ M ly. 82 x 83) (x) 


The following key theorem states that all reachable states are covered if the 
initial state is covered: 
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Theorem 1. Let (L,M) be closed and invariant under P. Assume Io E L, 
sh E M lo, so 3 sh, and (lo, so) >* (l, s). Then l € L and there exists s’ such 
that s’ € Ml ands g s’. 


Proof. By induction on the number of steps in (lo, so) >* (L, s). The following 
sketches how the run of covering states is constructed. The first line represents 
(lo, so) * (lL, s) and the states in the third line are all part of the certificate. 


(lo, so) = (l, $1) = Tei = (1, s) 
x x 
(L1,t1) ~ (1, t) 
Ž Z = Z a x 
(lo, 8) (11, 54) sas (1, s") 


From the assumptions on lo, so, and sj, we can first apply the self-simulation 
property of —> to (lo, so) > (l, s1) to obtain a tı such that sı x tı and (lo, sọ) > 
(11, t:). As the certificate is closed we thus get lı € L and we can find an sj € Ml, 
such that tı x s (and thus sı x s| by transitivity). The induction hypothesis 
can then be applied to l, s1, and s4. 


We will now say that a certificate (L, M) is admissible iff 


— it satisfies P, 

— it is closed, 

— it covers the initial state (i.e. there is an sj € M Ip such that so =< sọ), 
— and there is no l € L with Fl. 


Corollary 1. If F is monotone w.r.t. < and the certificate (L, M) is admissible, 
then Fils. (lo, so) * (l, s) AFL. 


2.2 An Abstract Certificate Checker 


In practice, the certification process has to consider one additional complication. 
A model is typically described in terms of human-readable identifiers, while most 
model checkers and the verified model checker Munta [30] in particular represent 
these as natural numbers internally to allow for efficient indexing. In our certifier, 
this is accounted for by relabeling the human-readable identifiers in a given model 
to natural numbers in a first (verified) pre-processing step. To save additional 
transformations of the certificate after it was emitted, we let the unverified model 
checker additionally emit a textual description of such a renaming. The certifier 
then just needs to check that the given renaming is injective to ensure that it 
can safely be applied. 

Together with the theoretical analysis laid out in the last section, we can thus 
derive the following strategy for certifying unreachability: 


— An unverified model checker explores the reachable state space of a given 
model symbolically and checks that none of the discovered states (l, s) fulfills 
Fil. 
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1 definition check (L, M) = 

2 monadic_list_all L (Al. do { 

à let S = Ml; 

4 let next = succes l S; 

5 monadic_list_all next (A(U’, 9"). do { 

6 xs < SPEC (Aas. set xs = S"); 

7 if xs = || then return True else do { 
8 bl + return (l € L); 

9 ys + SPEC (Aas. set xs = MI’); 

10 b2 + monadic_list_all xs (Aw. 

11 monadic_list_ex ys (Ay. return (£ x y)) 
1 ); 

13 return (b1 A b2) 

14 } 

is H 

= p 


Listing 1.1: Monadic program to check whether a certificate is closed. 


— The set of explored states is emitted as a certificate, possibly followed by 
compression (see section 4). 

— The model, the final state predicate F', the certificate, and a description of 
the renaming that was used for the states are passed to the verified certifier. 

— The certifier checks that the given renaming is injective, renames the model 
accordingly, applies the product construction and checks that the certificate 
is admissible. 


If the process is successful, we can conclude by Corollary 1 that no “bad” state 
(1, s) (i.e. with F l) is reachable symbolically. We will argue that this really implies 
that the model is safe in the concrete case of timed automata in section 2.3. 

We now lay out how a verified certificate checker that implements said 
strategy for an abstract transition system can be constructed in Isabelle/HOL. 
Listing 1.1 displays the definition of the core of the checker that checks whether 
the certificate is closed in the sense defined above. The program is defined in the 
non-determinism monad of the Imperative Refinement Framework (IRF) [20]. 
Some parts, such as checking set membership or converting a (finite) set to a 
list are still left abstract. A non-deterministic specification SPEC Q returns some 
value v with Qv. 

The body of the program (lines 2-16) iterates over all discrete states in the 
certificate L and checks that all corresponding symbolic states are covered. Line 
3 retrieves the symbolic states that correspond to discrete state / and in line 4 
their symbolic successor states are computed. The result (next) is a list of pairs 
of a discrete state and the set of its corresponding symbolic states. The loop 
ranging from lines 5 to 15 iterates over this list to ensure that all the successor 
states are covered. Given a discrete state l’ and a set of symbolic states S$’, line 6 
first converts it into a list xs that can be iterated over. This turns into a vacuous 
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operation when the algorithm is refined to an executable version where sets are 
implemented as lists. Line 8 checks that l’ is also part of the certificate. Then, in 
line 9 the set of corresponding symbolic states is retrieved and converted to a list 
ys. Finally, lines 10-12 ensure that all states in xs are subsumed by some state 
in ys. 

To prove soundness of check, we mainly need correctness theorems for the 
monadic combinators monadic_list_all and monadic_list_ex. Given a list xs and 
a monadic implementation Q; of a predicate Q, they check whether all states 
(at least one state) in zs satisfy (satisfies) Q. This is the correctness theorem for 
monadic_list_all, for instance: 


(Va. Qix < SPEC (Ar.r <> Qz2)) 
= monadic_list_all xs Qi < SPEC (Ar.r <> list_all zs Q) 


where list_all xs Q holds if and only if Q holds for all elements in ws. After setting 
up the IRF’s verification condition generator with this rule and the corresponding 
rule for monadic_list_ex, it is easy to prove that check is sound: 


check (L, M) < SPEC (Ar.r ==> closed (L, M)) 


where the property closed (L, M) corresponds to condition (*) from above. 

We then use standard refinement techniques to obtain an algorithm check; that 
refines check, replacing sets by lists. However, the algorithm is still specified in the 
non-determinism monad and therefore not executable. We use a simple technique 
to make it executable. Consider the following theorem for monadic_list_all: 


monadic_list_all xs (Ax. return(Px)) = return (list_all xs P). 


It allows us to replace the non-deterministic combinator monadic_list_all by the 
deterministic list_all, pushing return to the outside. By exhaustively applying a 
set of such rewrite rules we obtain an alternative definition of check; where return 
appears only on the outermost level, and the inner term is deterministic and thus 
executable. Using these techniques, we obtain a simple certificate checker that 
is executable, provided that we can implement the elementary model checking 
primitives such as the subsumption check or computing the list of successors of a 
state. 


2.3 Transferring the Correctness Theorem 


For timed automata, the abstract transition system studied above is the zone 
graph + zqa) of a given (single) automaton A. One can show that it simulates 
— (completeness of > z@a)): 


(Lu) >a (Uw) Aue Z = (AZ! (Z) >zaa UZ) Aw € Z’). 


This simulation property is sufficient to establish that if there is no reachable 
state (l, Z) in >za(a) with Fl, then no final state (l, u) is reachable in —>4: 


(Al, Z. (lo, Zo) > zaa) (l Z) A F1) Auo € Zo 
=> (Al, u. (lo, uo) >% (Lu) A FI) 
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In the formalization, these proofs rely on instantiating a general theory of 
simulations in transition systems that is derived from the theory of Wimmer and 
Lammich [31]. From Corollary 1 we get that there is no reachable final state in 
—zqA) if the certificate check is passed. Finally, by correctness of the renaming 
process and the product construction, we can conclude that there is no final 
reachable state in the input model if there is no final reachable state in > 4. 


2.4 Implementing a Concrete Checker 


All the elementary model checking primitives we need for certification have 
already been implemented [31]. The abstract implementation presented above 
assumes that the model checking primitives are implemented in a purely functional 
manner (as they are just regular HOL functions). The existing (verified) model 
checker [31], however, is an imperative implementation in the Imperative HOL 
framework. Imperative HOL [7] is a framework for specifying and reasoning about 
imperative programs in Isabelle/HOL. It provides a heap monad in which one 
can use—analogously to the ML family of programming languages—imperative 
references and arrays to express imperative programs. Usually, once we have used 
an imperative implementation anywhere, the whole program would need to be 
stated in the heap monad. However, we can employ a technique similar to the 
one that is used for Haskell’s ST monad [21] to erase the heap monad in a safe 
way under certain circumstances. 

More precisely, if it can be deduced from the type of an imperative computation 
that no information about references or arrays on the heap can be leaked to the 
outside of the computation in its result, then the heap monad can be erased for 
this computation, yielding a pure computation. In the certifier, this is primarily 
used for computing the symbolic successor of a zone Z for a certain transition. 
To that end, an immutable representation of the DBM M corresponding to Z is 
copied to the a newly allocated imperative array, then the imperative pipeline 
of computations to compute the successor M’ is applied to M, and finally M’ 
is copied back to an immutable array. Taken together, this whole computation 
does not contain the type of an array or reference in its result type, and thus 
can safely be turned into a pure computation. As a consequence, we are able to 
reuse the existing verified model checking primitives, while being able to state 
the certificate checking algorithm purely functionally. 

In the concrete checker, the mapping M is implemented using a verified 
functional hash table implementation based on so-called diff arrays [19]. This 
data structure provides a purely functional interface to an underlying imperative 
array. When a diff array is updated, it performs the update on the imperative 
array, and stores a difference that can be used to re-compute the old state of the 
array. Reading from the most recent version of a diff array is fast as the value 
can directly be read from the underlying imperative array. If an old version is 
accessed, the whole array has to be copied to recompute the old version. This 
gives diff arrays good performance characteristics, as long as they are mostly 
used linearly. This is the case in our application as the hash table is filled in an 
initial phase, after which the hash table is used in a read-only manner. 
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2.5 Parallel Execution 


The attentive reader may wonder why we care about a purely functional imple- 
mentation of the certificate checker at all. Indeed, we could use existing techniques 
[31] to obtain an imperative implementation of the certificate checker in the 
heap monad. However, in this setting it would be hard to justify the soundness 
of executing parts of the checker in parallel. In the purely functional setting, 
this is much simpler. Our approach to parallel execution is minimalist: we only 
provide means to execute the map combinator on lists in parallel. This is achieved 
by another custom code translation that is part of the trusted code base. The 
parallel implementation of map uses a task queue that will contain the individual 
computations that need to be run for each element of zs, and uses a fixed number 
of threads to work through this list and assemble the final result. 

We exploit this map implementation to work through the list of discrete states 
L in parallel, using the equivalence: 


list_all Q xs = list_all id (map Q as). 


In doing so, we lose the ability to stop execution early once a list element does 
not satisify Q. For the certificate checker, however, we assume that usually 
the certificate is correct, meaning that we have to go through the whole list 
anyway. We only parallelize the outermost loop of check; because this should 
yield reasonably-sized work portions, given that the size of L will typically at 
least be in the hundreds. 


3 Scaling Performance 


In this section we discuss two techniques to improve the performance of the 
certificate checker without increasing the verification effort significantly. 


3.1 Monomorphization 


Isabelle/HOL supports polymorphism and type classes, which are valuable fea- 
tures for sizeable formalization efforts. Large parts of our formalization also make 
use of these features, e.g., most of the timed automata semantics are formalized 
for a general time domain, and operations on DBMs are applicable on DBMs 
whose entries are formed from more general algebraic structures than the ring of 
integers. While this yields an abstract and general formalized theory, it can get 
in our way when trying to obtain efficient code. 

When generating SML code from HOL, Isabelle uses a so-called dictionary 
construction to compile out type classes, which are not supported by SML. This 
means that most functions carry a large number of additional parameters, which 
are used to look up elementary operations, such as addition of two numbers. 
These additional lookup operations degrade performance. One solution is to 
ensure that all relevant constants that are exported to SML are monomorphic 
(i.e. specialized to the integer type), eliminating the need for the dictionary 
construction in most places. Thus, we apply a semi-automated procedure to 
achieve this monomorphization. 
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3.2 Integer Representation 


Types such as int or nat are unbounded in Isabelle/HOL meaning they are 
implemented with the help of big integers in the target languages. To improve per- 
formance, we want to use machine integers instead, and instruct Isabelle/HOL’s 
code generator to do that. This is still sound: SML’s standard integer operations 
throw an exception if an overflow occurs instead of silently wrapping around. The 
code generator can only achieve partial correctness anyway: if program execution 
does not fail, then its result is consistent with the evaluated HOL term. 


3.3 Refined Code Equations 


The last type of optimizations we use can be considered to belong to the cate- 
gory of micro-optimizations. These are improved code generator translations for 
elementary operations and combinators. We employ such improved translations 
to use native implementation language primitives to convert from mutable to 
immutable arrays and back. The other such optimizations we use, is to directly 
use integer values as counters in imperative loops instead of a natural number 
representation that would box the integers in a data constructor. In the same 
way, we use integers directly for array indexing. 


4 Certificate Compression 


In this section, we present two techniques to compress the unreachability certifi- 
cate. By compression we mean reducing the number of zones that are present in 
the certificate for each discrete state, using the unverified model checker. The 
first technique relies on subsumption. As explained above, it is possible that the 
model checker adds a zone Z to the set of explored states and later another zone 
Z' with Z C Z' (i.e. Z’ subsumes Z). Thus the first technique simply filters the 
set w.r.t. C in the end. 

The second technique relies on the following idea: we replace one or more 
zones by their union, and check that the state space is still closed. This means 
that we have to check that all the successors of the larger zone are still covered 
by the current set of states. In that case, we can discard the old zones, and 
replace them by their union. As the union of two zones is not necessarily convex 
and thus cannot be represented as a DBM, we do not compute a precise union 
of zones but their convex hull. This operation is rather cheap as it amounts to 
taking the pointwise maximum of DBM entries. After computing the convex hull 
of a number of zones (in canonical form), we only need to apply the expensive 
operation to restore a canonical form once. 

The latter technique yields a whole family of compression algorithms by 
iterating one of the following operations for each discrete state until a fixed-point 
is reached: 


a) the convex hull of all zones is computed; 
b) the convex hull of the first two zones is computed; 
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c) the convex hull of the first two zones that can successfully be joined is put to 
the front of the list; 

d) same as c) but considering only discrete states for which compression was 
successful in the last round; 

e) same as d) but iterating the operation until saturation. 


The next section contains an experimental evaluation of these techniques. 

Note that similar techniques for reducing the search-space could also be applied 
already during model checking. By doing so, the number of states explored and 
the runtime of model checking could be reduced. This, however, comes at the 
risk of producing spurious model checking results (i.e. a final state might be 
deemed reachable, although there is no corresponding reachable state in the 
timed automaton). 


5 Experimental Evaluation 


We evaluate the checker on a set of benchmarks that is derived from UPPAAL’s 
standard benchmark suite [22]. Additionally, to cover the advanced modeling 
features of committed locations and broadcast channels, we use a set of bench- 
marks that is derived from the pacemaker models of Jiang et al. [17] and a 
modified version of the FDDI benchmark with broadcast channels. A prototype 
SML implementation of a timed automata model checker (Mlunta) is used to 
compute the certificates. We use reachability properties of the form EQ false to 
enforce that the model checker explores the complete state space. The results are 
given in Table 1. The problem size is specified as the number of automata in the 
network. We report the total runtime (wall time) of: 


1. the tandem consisting of Mlunta (using the first compression technique) and 
the (verified) certificate checker, both compiled with MLton; 

2. the individual runtime of the (verified) certificate checker for a varying number 
of threads for parallel computation, compiled with Poly/ML as it is the only 
SML compiler that supports multi-threading; 

3. the runtime of UPPAAL configured for depth-first search (like Mlunta); 

4. the runtime of an unverified SML implementation of the certificate checker 
based on Mlunta (compiled with MLton); 

5. and the runtime of the fully verified model checker Munta [31] extended with 
the improvements from sections 2 and 4 and compiled with MLton. 


As can be seen from the results, the tandem is still one order of magnitude 
slower than UPPAAL, but certificate checking in isolation is also up to one order of 
magnitude faster than the previous verified model checker [31]. Note that Mlunta 
explores significantly more states than UPPAAL and Munta for “Pacemaker”. 
Multi-core scale beyond two threads is relatively unsatisfactory, however. In micro- 
benchmarks, we have identified that the problem appears to be with memory 
allocation on the heap, even if no data is shared among threads (in our case, only 
the certificate is shared but successors are computed locally). There does not 
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seem to be an obvious way to improve on this situation for SML implementations. 
Finally, one can see that the verified certifier is not drastically slower than the 
unverified implementation based on Mlunta, indicating that the verified certifier 
is not missing any obvious significant optimizations. 


Certifier for #threads 


Model Size UPPAAL Tandem Munta Unverif. 1-MLton 1 2 3 4 
FDDI 8 0.33 0.79 1.01 0.14 0.21 0.99 0.64 0.57 0.53 
10 5.93 1.77 2.50 0.40 0.45 2.13 1.36 1.26 1.20 


12 92.66 3.93 5.42 0.90 1.20 3.41 2.33 2.25 2.18 
14 1874.28 7.28 10.73 1.86 2.22 5.36 3.94 3.90 3.88 


16 AKE 12.80 19.51 3.47 3.72 11.09 6.49 6.50 6.51 
FDDI broad 8 0.34 0.28 1.07 0.10 0.08 0.26 0.19 0.17 0.16 
Fischer 5 0.24 2.78 6.33 0.72 0.53 1.76 1.07 0.98 0.91 
6 34.74 143.72 377.70 40.99 26.58 40.60 25.47 24.16 23.67 
CSMA 5 0.04 0.94 4.42 0.31 0.28 1.44 0.87 0.80 0.76 
6 1.53 13.48 65.16 5.24 4.18 12.16 8.04 7.87 7.76 
Mode 
Pacemaker 1 0.02 0.16 0.37 0.03 0.03 0.25 0.16 0.14 0.13 
2 0.02 0.75 3.20 0.17 0.26 1.23 0.75 0.68 0.65 
3 0.03 1.39 4.23 0.34 0.46 2.53 1.55 1.40 1.31 
4 0.02 11.80 0.70 3.24 3.71 12.13 8.38 6.69 6.66 
5 0.02 30.84 0.86 9.13 10.07 26.60 18.58 18.15 17.95 


Table 1: Benchmarks results on a machine with 16 GB RAM and an Intel(R) Core(TM) 
i7-4610M CPU at 3.00GHz with two cores and two threads per core. The column 
labeled “Tandem” gives the runtime for a combination of the unverified SML tool and 
the verified certificate checker. The next column gives the runtime of the unverified 
SML certifier, followed by the runtimes of the verified checker for a varying number of 
threads. All times are given in seconds. 


Table 2 gives the results of evaluating the different compression algorithms 
on the same set of benchmarks. The second variant is always applied to the 
compression result of the first variant to avoid trivial computations of the 
convex hull. Variant 2c) (the most expensive one) can produce drastically smaller 
certificates than any other variant, and its minimum compression factor is an 
order of a magnitude higher than for any other variant. Nevertheless, only variants 
1 and 2a) appear to be useful in practice, as they are relatively cheap to compute. 
The other variants could prove useful if the certificates were produced by a 
significantly more efficient model checker, such as UPPAAL or TChecker [15]. On 
a final note, we have constructed a more than 95% smaller but valid certificate 
for the Fischer benchmark, suggesting that there is room for improvement on 
the compression algorithms. 
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Variant 
Model Size 1 2a 2b 2c 2d 2e 
FDDI 8 0.21 0.21 1.72 69.53 3.65 3.43 
FDDI broadcast 8 0.00 48.94 48.94 48.94 1.06 1.06 
Fischer 5 22.03 22.03 22.72 43.06 30.40 30.40 
CSMA/CD 5 26.06 41.54 43.84 81.16 58.94 47.54 
6 24.86 41.91 44.02 88.35 63.24 47.02 
Mode 
Pacemaker 1 16.07 25.00 30.80 58.04 29.02 29.02 
2 24.00 26.38 30.37 58.68 35.87 35.22 
3 12.96 17.62 19.23 46.92 25.30 25.01 
4 13.82 20.02 23.60 41.48 26.16 24.71 
5 17.14 22.48 25.46 39.69 28.18 26.88 


Average 15.71 26.61 29.07 57.58 30.18 27.03 


Table 2: Certificate compression factors (given in %). 


6 Conclusion and Future Work 


We have presented a verified certifier of unreachability certificates for a timed 
automata. The certificates are ought to be produced by an unverified model 
checker. Experimentation shows that verified certificate checking in isolation is up 
to an order of magnitude faster than what was previously possible with a verified 
model checker [31]. The performance of a tandem of an unverified model checker 
and the verified certifier could be improved by replacing the certificate-producing 
part with a highly optimized tool, possibly opening room to use some of the more 
powerful certificate compression techniques we suggested above. As we pointed 
out above, there appears to be further room for improvement on the certificate 
compression algorithms as well. 

Moreover, more sophisticated tools also employ more powerful abstraction 
techniques, for which our proposed certification technique is still suitable—to a 
large extent without requiring additional verification effort. An exception is the 
implicit abstraction technique studied by Herbreteau et al. [14] as it does not 
compute abstractions of zones explicitly but rather checks subsumptions of the 
form Z C a(Z’) implicitly, meaning that one would have to prove correctness of 
the subsumption check to validate certificates produced by such a model checking 
process. 

Finally, we intend to extend this work to certification of emptiness of timed 
Büchi automata in the future, using the idea of subsumption graphs [13] and 
relying on an unverified model checker implementation for timed Büchi automata 
to produce the certificates [13,18]. 
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Data Availability Statement 


The datasets generated and/or analyzed during the current study are available in 
the Zenodo repository [32]: https://doi.org/10.5281/zenodo.3679245. The artifact 
has been tested on the TACAS artifact evaluation VM [12]. 
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Abstract. We present an algorithm for active learning of deterministic timed au- 
tomata with a single clock. The algorithm is within the framework of Angluin’s 
L* algorithm and inspired by existing work on the active learning of symbolic 
automata. Due to the need of guessing for each transition whether it resets the 
clock, the algorithm is of exponential complexity in the size of the learned au- 
tomata. Before presenting this algorithm, we propose a simpler version where the 
teacher is assumed to be smart in the sense of being able to provide the reset 
information. We show that this simpler setting yields a polynomial complexity of 
the learning process. Both of the algorithms are implemented and evaluated on 
a collection of randomly generated examples. We furthermore demonstrate the 
simpler algorithm on the functional specification of the TCP protocol. 


Keywords: Automaton learning - Active learning - One-clock timed automata - 
Timed language - Reset-logical-timed language. 


1 Introduction 


In her seminal work [10], Angluin introduced the L* algorithm for learning a regu- 
lar language from queries and counterexamples within a query-answering framework. 
The Angluin-style learning therefore is also termed active learning or query learning, 
which is distinguished from passive learning, i.e., generating a model from a given data 
set. Following this line of research, an increasing number of efficient active learning 
methods (cf. [38]) have been proposed to learn, e.g., Mealy machines [34,30], I/O au- 
tomata [2], register automata [25,1,15], nondeterministic finite 